Predict Students' Dropout and Academic Success

In this project we anylize the data from the Students Dropout Repository where data were collected from a higher education institution, encompassing students from various undergraduate programs. The dataset includes academic history, demographic information, and socio-economic factors. Our goal is to predict whether a student will fail or succeed.

Exploratoty Data Analysis (EDA)

How does the distribution of enrollment ages vary across different outcomes?

admission histogram

We can see that the dropout rate among individuals aged 25 to 40 is higher than in other age groups within this range. This may suggest that as age increases, the likelihood of graduating decreases.

Are there any noticeable differences in the course taken, between students who dropout and those who successfully graduate?

course bar plot

We observe distinct trends across different courses. For instance, Informatics Engineering (9119) and Equinculture (9130) show a higher likelihood of student dropout, whereas Nursing (9500) and Social Service (9238) exhibit a greater probability of student graduation.

How do the distributions of admission grades differ between students who dropout, those who enroll, and those who graduate?

admission grade histogram

Although the distributions appear similar, the Dropout group shows a higher concentration at the minimum grade.

Are there any interesting patterns when plotting multiple demographic variables (e.g., gender, nationality, parental qualifications) against each other in relation to student outcomes?

gender barplot

As we can see the Male group has considerable more population in the dropout group.

mother occupation barplot

After filtering the most common occupations, it's notable that 'Student' (0) and 'Other situation' (90) have a larger representation among the dropped out group of students.

father occupation barplot

Once again, 'Student' (0) and 'Other situation' (90) show a higher presence in the dropout group.

How does it look the distribution of all curriculum credicts depending on the student cathegory?

units 1st semester histograms units 2nd semester histograms

As observed, the dropout group has a significant number of individuals with zero completed curricular units.

Modeling

We have implemented three machine learning models. The first, a Decision Tree Classifier, serves as our baseline model providing a simple and interpretable approach that helps us evaluate the performance of more advanced models. The second, Logistic Regression, is a more robust model, particularly effective for binary classification when the relationship between features and the outcome is approximately linear. Finally, the Random Forest Classifier offers a more sophisticated solution by combining multiple decision trees to improve accuracy and reduce overfitting.

What we see bellow are the confusion matrices for the three respective models.

decision tree

Global Accuracy of the model: 0.68

logistic regression

Global Accuracy of the model: 0.75

random forest

Global Accuracy of the model: 0.77

Overall, the Random Forest Classifier delivers better performance compared to the other models. However, as seen in all confusion matrices, there is a significant imbalance in the "Graduate" and "Enrolled" groups. In other words, while the models perform well when predicting dropouts, their accuracy goes down when predicting the other categories. This discrepancy is likely due to the data imbalance itself, which skews the models' predictions.

Despite this imbalance, we have developed a reliable model to analyze student situations. This is valuable for institutions, as it enables them to take proactive measures when addressing the risk of student dropouts.

Thanks for reading! 🧑‍💻💕