by Jinhong Wu @.jinwu / 4:30 PM EDT. September 19, 2024.
In 2023, BusyQA and the Ministry of Citizenship & Multiculturalism entered into a partnership to develop Black Youth across Ontario looking to gain employment in innovative technology sectors such as AI, Data Science, Cybersecurity, Cloud, Web Development and Test Automation. With the announcement came a flood of applicants across Ontario wanting a spot in the program. However with only 30 spots available, we received over 1,200 applications.
With 2 full time admission officers working on a month schedule, we knew we had our work cut out with reviewing and selecting the right participants. To be more efficient, we decided to lean in on AI and set up data points across the admission process. The admission and student performance data were used to build and train our admission AI bot, Taylor. She is a green eyed machine that knows a lot about BusyQA students and their varying levels of ambition.
We hired Jinhong Wu, an AI Consultant to lead the team that included many of our Data Science graduates to build and train Taylor, our first BusyQA AI bot. This blog describes the technical approach taken by Jinhong and the outcome predictions from Taylor. Happy reading!
The Leap Admission Project
Jinhong Wu is an AI Consultant who has worked with Royal Bank of Canada and the University of Toronto. She led a team of six BusyQA data science engineers and reported to the head of AI & Data Science program.
1.0 Project Objectives
BusyQA is revolutionizing its admission process for the Leap Program by tapping into the power of Artificial Intelligence. The goal of this leap admission project is to pinpoint the crucial pre-entry characteristics that drive student success and refine BusyQA admission criteria accordingly. Using advanced Machine Learning models, we will assign higher marks to the key pre-entry traits, ensuring we select the most suitable candidates. At the end, we will answer the following questions based on the results generated from the Machine Learning models:
What pre-entry features are essential for driving student success?
Based on the insights from the Machine Learning models, how would you recommend the BusyQA team redistribute marks for each pre-entry feature?
2.0 Data Overview
The data for this project was collected from several Excel files containing information on
student attendance, final grades, and demographic details across various courses. The specific files used were:
AI & Prompt Engr_Student Attendance.xlsx
Final Grade Report - AI & Prompt Engineering.xlsx
Data Science & Machine Learning_ Student Attendance.xlsx
Final Grade Report - Data Science & Machine Learning.xlsx
Test Automation_Student Attendance.xlsx
Final Grade Report - Test Automation Engineering.xlsx
Cloud Computing_Student Attendance.xlsx
Final Score Report - Cloud Computing.xlsx
Student Demographic info.xlsx
The data was taken from the first cohort of the leap program and explores various details about each of the students from personal characteristics to performance metrics in their respective classes. There are 26 students and 18 pre-entry features. A master sheet with all of these tables combined was created, and a UID was also created based on the course the student enrolled in and their ID in the class for each student individually, as shown in the table below.
3.0 Data pre-processing methodology
Data preprocessing is essential in machine learning to improve data quality, enhance model performance, and ensure compatibility with algorithms. It involves tasks like handling missing values, encoding categorical data and scaling features, ultimately leading to more accurate and efficient models.
3.1 Student pre-entry data pre-processing:
For data cleaning, the team replaced NaN values with "No info" instead of removing rows with missing values. This prevents data loss and preserves sample size, especially since the number of pre-entry features is close to the number of students.
The team used encoding methods to convert categorical columns into numerical values so that the machine learning models could understand them. Both Label Encoder and One Hot Encoder were used since some columns, like income, have an inherent order, while others, like gender, do not. These transformations ensure that categorical variables are properly represented in a numerical format compatible with most algorithms.
The team used Min-Max scaling, or normalization, to transform numerical features to a fixed range, typically between 0 and 1, while preserving relative differences between them. This ensures consistency, facilitates improved convergence of machine learning algorithms, maintains relationships between data points, and is applicable to features with clear minimum and maximum values.
3.2 Student post-entry data pre-processing:
Attendance Percentage was calculated by dividing the total number of classes attended
by the total number of classes for each student, as shown below:
Overall Course Percentage represents the proportion of course material a student
completed on the website.
Adjusted Grade Score was calculated by dividing the marks a student obtained by the total possible marks, with a small penalty applied if the total marks were less than those of other students in the class (e.g., if a student missed an assignment, midterm, or final).
A final weighted score to measure a student's success in their class was calculated using the following formula: Weighted Score = 70% * Adjusted Grade Score + 20% * Attendance Percentage + 10% * Overall Course Percentage. The team considers the adjusted grade score to be the most important performance measure, followed by attendance and the overall course percentage shown on the website a student has learned.
4.0 Exploratory Data Analysis
Below are the detailed analysis of data with visualizations:
Firstly, we look at the age ranges in each of the courses. The age range for the entire cohort is 18 - 34. We see that the youngest members of this cohort overwhelmingly chose the AI & Prompt engineering course, while the older members were more drawn to the cloud computing course.
Another relationship we looked at was the country of birth for the students. It is clear by this graph that the majority of students hail from Nigeria, with the second most common country being Canada. We then proceed to see one student from Ethiopia, Kenya and Saudi Arabia.
As shown above, the majority of students do not have dependents, while there are 5 students with dependents.
When looking at the amount of students who are employed and comparing it to the amount of students who are unemployed, we found that the numbers are fairly similar, although slightly skewed towards more students being unemployed.
Most students in the program consider themselves underemployed, meaning, they are
overqualified for their current job. According to statistics Canada, approximately 16% of Black Canadians in the workforce are overqualified for their job.
The majority of the students in The Leap program are permanent residents of Canada. The second most populous group is Canadian citizens, with one student being under refugee status.
Most students have already completed a master’s degree before joining the program. Most of the students already feel they are underemployed. The second Most selected option is a high school diploma, with the last option being a college diploma. There was no option for an undergraduate degree.
The majority of the students made under $40k in the year of 2022, highlighting the very real issue of low-wages and poverty in Canada.
This graph displays the marital status of all students. As shown, the vast majority of students are single (meaning unmarried) while very few are either married or in a common law relationship.
The household income was also examined for the purpose of the EDA and we found that the amount was still >$40k. This was expected due to the fact that most of the students are single.
5.0 Model Results and Interpretation
To identify the crucial pre-entry features that help BusyQA select the best candidates, the team is training machine learning models to accurately predict student performance. Based on the trained model, the team will interpret which features are significant for making performance predictions. Since this is a regression problem, the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are chosen as the evaluation metrics. A lower MSE or RMSE, closer to 0, indicates a better fit, implying that the predictions are closer to the actual values.
5.1 Linear Regression
The linear regression model is a relatively simple model that suits our dataset well, as the number of columns (18 pre-entry features) is close to the number of rows (26 students). More complex models tend to overfit. Linear regression results are also easy to interpret, as the coefficients before each feature indicate the feature's importance in predicting the final weighted score. The larger the absolute value of a coefficient, the greater the impact of that feature. However, it is noticeable that a linear regression model assumes a linear relationship between a student’s characteristics and their weighted score. In linear regression models, the team has tried 3 different models: simple linear regression, ridge regression and lasso regression.
Analysis of Regression Models:
Linear Regression: MSE and RMSE are quite high, indicating that the model might not be fitting the data well.
Ridge Regression: As the alpha value increases from 0.01 to 1, both MSE and RMSE decrease, indicating an improvement in model performance. However, at alpha values of 10 and 100, the performance slightly degrades, but it is still better than Linear Regression.
Lasso Regression: The lowest MSE and RMSE are achieved at alpha = 1, which indicates the best performance among the Lasso models. However, the performance degrades significantly for alpha = 10 and 100, suggesting over-penalization.
Below is a graph showing the trend in the Models as we gradually increase alpha.
Coefficient Analysis
Ridge Regression: Retains more features with non-zero coefficients, indicating it is less aggressive in feature selection and tends to include more features in the model.
Lasso Regression: More aggressive in feature selection, driving many coefficients to zero, especially for higher alpha values. This indicates a stronger ability to perform feature selection, potentially leading to a more interpretable model with fewer features.
Performance Metrics: Ridge Regression showed the lowest MSE and RMSE for the
best-performing alpha values. For example, with alpha = 10, the MSE = 1011.92 and RMSE = 31.81, which is significantly lower than the MSE and RMSE for both Linear and Lasso models.
Conclusion
Best Model: Based on MSE and RMSE, the Ridge Regression model with alpha = 10 appears to be the best model. It has the lowest MSE (1011.92) and RMSE (31.81), indicating the best predictive accuracy and model fit among all models evaluated. Also due to its superior performance metrics and its ability to handle multicollinearity by penalizing the magnitude of coefficients, leading to more stable and generalizable predictions. Lasso, while useful for feature selection, did not perform as well in this case and might risk omitting potentially relevant features.
Top 10 Important Features
For the best model (Ridge Regression with alpha = 10), the top features influencing the
weighted score, based on the magnitude of their coefficients, are:
1. Highest Level of Education_No Info - 3.894136
2. Household Income_Under 40K - 3.586457
3. Highest Level of Education_Masters _ 2.884406
4. Current Job Title_Part-time sales association - 2.583920
5. Martial Status_Married - 1.980128
6. Number of Dependants - 1.671337
7. Are you Underemployed_Yes - 1.559359
8. Personal Annual Income_No Info - 1.499945
9. Gender_Male - 1.369423
10. Dependants_Yes - 1.245214
These features have the highest absolute coefficient values and thus have the most significant impact on predicting the weighted score. Below is a Graph showing the Top 10 Features:
5.2 Decision Tree and Random Forest
Decision Tree and Random Forest, as one of the most common machine learning models used in the industry, can not only solve regression problems but also handle non-linear relationships in the dataset. The tree-based model also provides interpretability. The importance values reflect how much each feature contributes to the prediction of the target variable(weighted score). Here are the top 10 important features along with their importance scores:
1. Age: 0.204001
2. Current Job Title_Part-time sales association: 0.203381
3. Highest Level of Education_Masters: 0.140337
4. Highest Level of Education_No Info: 0.061107
5. Household Income_Under 40K: 0.054298
6. Martial Status_Married: 0.045639
7. Canadian Status_PR: 0.029642
8. Current Job Title_Unemployed: 0.023260
9. Current Job Title_No Info: 0.023116
10. Martial Status_Single: 0.022178
What These Results Mean
Feature Importance: The importance scores indicate which features are most influential in predicting the weighted score. Features like Age and Current Job Title and Education level have high importance, suggesting they play a significant role in determining the outcome.
Model Performance: The Random Forest model has provided a relatively good fit to the data, as evidenced by the RMSE value.
In tree-based models, pruning techniques will be used and other hyperparameters will be tuned to achieve better model performance. The team will continue to optimize the model and reduce MSE/RMSE value. After finalizing the tree model, they will rank the importance of different features and decide how many scores should be assigned to each feature to help BusyQA select the best students.
If you want to learn how to build your own AI bot or curious about how Artificial Intelligence can help make you more productive at work; consider taking a course from BusyQA. Our courses are taught by highly-trained instructors and offer hands-on work experience co-op opportunities. Our AI Consultants can also help you customize a digital AI solution that can help your business improve efficiency and reduce cost.
Comments