Graduate admission with Machine Learning.

May 17, 2021 6-minute read

Graduate admission using Machine Learning

The graduate admission can be really tough for most of the students. As a student that want to form part of a graduate school, i wonder myself, what are my chance to get into one of this schools?

In the webpage Kaggle, i found this incredible dataset that collect data from undergraduate students, such as the University Rating (the University place where study their undergraduate), their CGPA, their TOEFL Score and others. The most important thing that this dataset has is the Chance of Admit from every student based on their qualifications.

With all this data, it’s possible to use Multiple Linear Regression (since we had several variables) to create a model to predict the chance od admission of new students that aren’t in the dataset. So let get started!

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
from jupyterthemes import jtplot
jtplot.style(theme='monokai')

Read and visualizate the data

Using pandas, you can read the csv file and know things like how many rows and columns had.

data=pd.read_csv('Admission_Predict.csv')

students=pd.DataFrame(data)
students

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4	4.5	8.87	1	0.76
2	3	316	104	3	3	3.5	8	1	0.72
3	4	322	110	3	3.5	2.5	8.67	1	0.8
4	5	314	103	2	2	3	8.21	0	0.65

Modify the data.

Right now, the Serial No. and the GRE Score sections are useless for our analysis, since the Serial No. just is for indexing and in our times, due to COVID-19, most universities doesn’t require GRE Score for graduate admissions.

students=students.drop(['Serial No.','GRE Score'],axis=1)
students

	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	118	4	4.5	4.5	9.65	1	0.92
1	107	4	4	4.5	8.87	1	0.76
2	104	3	3	3.5	8	1	0.72
3	110	3	3.5	2.5	8.67	1	0.8
4	103	2	2	3	8.21	0	0.65

Now, let’s see how are the average students that came for all our university rating.

students_by_university=students.groupby(by='University Rating').mean()
students_by_university

University Rating	TOEFL Score	SOP	LOR	CGPA	Research	Chance of Admit
1	99.0769	1.88462	2.21154	7.74577	0.192308	0.548077
2	103.523	2.70561	2.92523	8.18374	0.299065	0.625981
3	106.887	3.36466	3.40226	8.55226	0.533835	0.71188
4	111.824	4.10811	4.00676	9.02162	0.797297	0.818108
5	113.667	4.5	4.35833	9.29117	0.866667	0.888167

In the data, we can see that there is two sections, that is:

SOP=Statement of Purpose
LOR=Letter of Recomendation Strength

That two sections had a value that goes from 0 to 5. In this case, Research had only a value 0 (if the student isn’t do research) or 1 (if the student do some research during his undergraduate).

The university rating is 1 to 5, depending on the reputation of the university where you came from.

Let’s have a look of the data with pairplot from seaborn. That gonna give us a pairs plot.

Explorstory Analysis

But what are a pairs plot? Well, it’s a couple of plots from one section to another. This gives me the oportunity to check if exist some relationship between two variables.

sns.pairplot(students)

We can see that the histograms are the distribution from a single variable (to see how it works). Also, you can check that, for example, in Chance of Admit and TOEFL Score had a relationship, while anothers isn’t had to much relationship.

It’s a fantastic idea check the correlation of the variables to see what can affect more our student profile.

fig, axs=plt.subplots(figsize=(10,10))
sns.heatmap(students.corr(),annot=True)
plt.savefig('heatmap.png')

From this heatmap, it’s easy to see that the principal features that affect the Chance of Admit are:

University Rating (0.71)
TOELF Score (0.79)
CGPA (0.87)

So, we can study the correlation between CGPA and the chance of Admit. I create a function that can plot this relation.

def analysis(section):
    global students
    fig,axs=plt.subplots(1,2,figsize=(20,6))
    #First plot, check a histogram
    sns.distplot(students[section],ax=axs[0])
    axs[0].spines['top'].set_visible(False)
    axs[0].spines['right'].set_visible(False)
    axs[0].grid(False)
    #Now, check the correlation of the variables
    sns.regplot(students[section],students['Chance of Admit'],ax=axs[1])
    axs[1].spines['top'].set_visible(False)
    axs[1].spines['right'].set_visible(False)
    axs[1].grid(False)
    plt.show()

analysis('CGPA')

It’s really easy to note that CGPA had a really strong correlation with the chance of admission of the students.

What about the TOEFL score?

analysis('TOEFL Score')

Also had a strong correlation, but it less than the CGPA.

And now, the last variable, the University Rating.

analysis('University Rating')

This doesn’t show too much correlation, we can see it in the plot from the right.

But with the plot from the left, we can look that the mayority of the universities in this data had a rating of 3.

Okey, so now it’s time to start the work.

Preparing Data from Machine Learning

Now that we understand the data, we can implement some machine learning methods to give a prediccion of future applicant’s chance of admission.

chances=students['Chance of Admit']
sections=students.drop('Chance of Admit',axis=1)

X_train, X_test, Y_train, Y_test=train_test_split(sections, chances, test_size=0.2, random_state=42)

You can see that I divide my DataFrame in two, one that only includes the Chance of Admit column and other that had all the others columns.

Like i’m gonna use Linear Regression for this problem, I had the following equation:

\begin{equation} y=b_0+b_1x_1+b_2x_2+b_3x_3+\cdots \end{equation}

where $x_1,x_2,\cdots,x_n$ are the variables. In this case, all the columns (TOELF Score, CGPA and others). All that variables can affect to the result, our target $y$, that is the column Chance of Admit.

I can reescribe my equation to:

\begin{equation} y=b_0+\sum_{i=1}^nb_nx_n \end{equation}

So, $y=chances$ and $x_n$ are the sections. But i’m not gonna use all my dataset, i’m gonna split my data to get four variables. This variables are $x_{\text{train}},x_{\text{test}},y_{\text{train}},y_{\text{test}}$. In the command, I use $80%$ of my data to train my model and i’m gonna test it with the restant $20%$ to see of accurate it’s the model.

Now it’s time to work with Multiple Linear Regression.

Machine Learning

Now we’ll implement the algorithms that let us predict the chance of admission from future students.

reg= LinearRegression()
reg.fit(X_train,Y_train)
Y_predict=reg.predict(X_test)
reg_score=(reg.score(X_test,Y_test))*100
print('The model has an accuracy of {0:.2f}%'.format(reg_score))

\begin{equation} \text{The model has an accuracy of }81.92% \end{equation}

That is a good accuracy! Now, you can play with this model and also you can predict your own chance of admition. It’s really cool! Using the following data from a imaginary student

TOELF Score = 90
University Rating = 3
SOP = 4.5
LOR = 4.5
CGPA = 8.75
Research = 1

what is he chance of admit?

#Test values
test_student=np.array([90,3,4.5,4.5,8.75,1])
test_student=test_student.reshape(1,-1)


predict=(reg.predict(test_student)[0]*100)
print('Your chance of admit is {0:.2f}%'.format(predict))

\begin{equation} \text{Your chance of admit is }70.31% \end{equation}

That’s all for this post! You can check the jupyter notebook from this problem here.