This is an entry to an online hackathon competition detailed at datafordiplomas.devpost.com . Please refer to that site for more information.
Dropping out of high school is a critical issue in today's society that contributes to many of the long term issues in the nation. To tackle this issue, we need to understand the precise cause and effect relationships.
Our goal is to use data to derive actionable recommendations to help increase the national graduation rate to 90% by 2020.
We found that the most important factors for improving graduation rates are household stability and economic security. We also found that weather plays an important role. On the other hand, we found that school spending and low food access may not be very important for improving graduation rates and have very diminishing returns upon investments to improve them. Our best model had 0.68 R^2 but with 150 features. Our final selected model was with 0.6 R^2 with 7 features.
- Coming from educated households were NOT strong predictors of whether you would graduate high school.
- We found that the percent of people living under the poverty line was a much stronger predictor than median household income. This implies that it's not necessary to raise incomes for everyone, but just for the neediest families. This helps to target policy in the right direction.
- Salaries and Wages were impactful to grad rates but NOTHING else. Investments need to be in the right place !
- Food Access is NOT a big impact to grad rates.
- Weather in your area IS impactful !
What We Did
Data Clean up. The dataset provided, although detailed had a major flaw in it. Schools with cohort sizes under 60 students, had graduation rates with a very large margin of error and broad bucket.
Integration of many third party datasets at the tract or school district level. E.g. weather, food availability, school financials.
Deep analysis of the various hypotheses
Prediction of what specific investments to impact the key predictive features on the graduation rate in order to understand the best area to invest in with the maximum impact.
Interactive visualization tools to explore the final merged dataset, understand the key predictive features and deep dive into the analysis we performed.
How we built it
Most of the work was in the analysis and identifying actionable recommendations at the national level. In addition, we built a set of interactive visualization tools on a web platform in order to help the user understand what we accomplished.
The data required significant cleaning and deep understanding. Firstly, the school districts for cohort sizes under 60 students had graduation rates that were in large buckets. In fact for 6-15 student cohorts, the rate was either <50% or ≥50%. This could cause significant inaccuracies in our model, hence we decided to leave these cohorts outside of our modeling.
Next, we used the unmerged dataset and we needed to merge the specific tracts per school into district level data. The technique mentioned in the documentation assumed that all metrics are treated the same way, which is incorrect. We only applied the weighted average on absolute value metrics and recalculated the % value metrics.
This project entailed an end to end data science methodology and implementation. Building a comprehensive data story, actionable recommendations using predictive analytics and finally a interactive visualization web platform to communicate the story.
Recommendations (More in the notebook on the site)
Given that Household Stability and Economic Security are the biggest factors, here are a few actions to take:
1) Continued investment in community development corporations (CDCs) to strengthen communities from a financial and social perspective.
2) Initiatives like the Low Income Housing Tax and the New Markets Tax Credit have brought vital services to low-income communities. Check out this article for more information. http://www.whatworksforamerica.org/ideas/the-future-of-community-development/#.VkqW59-rRE4
1) We found that more instructional spending tended to increase graduation rates, while other types tended to decrease it. Thus, all else being equal, it seems that schools should allocate more money towards instructional spending rather than support spending or administrative spending.
1) Colder areas tend to have lower graduation rates than warmer areas. One way to design around this problem is to have alternate school schedules for colder areas. It might be better to have a longer break during winter months and a shorter break during summer months for these schools
Look into a model for graduation rates per race to understand what features makes each race's graduation rate to be different.