In [712]:
import pandas as pd
import os
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from random import randint
import random
import seaborn as sb
import charts


# Executive Summary¶

What factors determine high school graduation rates? In this analysis, we build a random forest model using ~140 variables from census, school financials, weather, and food access data. Our results indicate that the most important factors determining high school graduation rates are economic security, household stability, and demographic variables. Importantly, we find that household stability, even when controlling for many other variables, is the most important factor next to socioeconomic status. We also find that weather plays an important role: colder areas tend to have lower graduation rates. Lastly, we find that spending on instruction (as opposed to other types of spending, such as on student support or administration) is the only type of spending that tends to improve graduation rates. However, our model predicts that our national graduation rate will not improve by increasing instructional spending more than its current level.

In terms of predictive power, our best random forest model has an out-of-bag score of 0.68, which is equivalent to an out-of-sample R2 score of 0.68.

As a result of our analysis, our recommendations are to focus more attention on creating a stable and nourishing household environment for students, finding ways for schools in colder areas to combat inclement weather, prioritizing improving quality of curriculum over increased instructional spending, and finding ways to release more precise graduation rate data without infringing on privacy rights.

Our analysis proceeds as follows: Part I outlines our theoretical foundations, hypotheses, and data sources. Part II deals with problems with the graduation rate data. Part III explores the relationship between the graduation rate data and our key variables of interest. Part IV builds and evaluates multiple predictive models. Part V ends with recommendations and conclusions.

# Part I: Theoretical foundations, hypotheses, and data sources¶

What determines high school graduation rates? Drawing from ecological systems theory, we focused on the environments that students interact with as the basis for their academic success. We then thought of specific factors within these environments that we thought might influence high school graduation rates.

The four major environments we looked at were: the household, school, neighborhood, and region. The following diagram helps to visualize this framework:

Now let's get into some of the details about each of these environments.

1) Household

Students spend more time at home then they do at school, which makes the home an important environment to dig into. We think that household structure and stability probably played important roles in a student's success.

Hypothesis 1: Students that live in more educated households will be more likely to graduate from high school since expectations of graduating high school are higher and parents probably exhibit better parenting patterns.

Variables:

• Percent of people over 25 years old that graduated high school

• Percent of people over 25 years old that graduated college

Hypothesis 2: Students that live in stabler households will be more likely to graduate from high school since they have to deal with fewer transitions and disruptions in their family lives.

Variables:

• Percent of households with married couple

• Percent of households with female only householders (no husband)

• Percent of households where the householder moved in during the past year

• Percent of people who had a different house 1 year ago

Hypothesis 3: Students that live in economically secure households will be more likely to graduate from high school because they have greater access to resources and less psychological strain.

Variables:

• Percent of people under the poverty line

• Median household income

Data: The data source and documentation for all of these variables can be found in the provided 2010 Census Data. These variables are at the tract level, which we weighted and merged with the graduation rates data.

2) School

Obviously, a student's school is a relevant environment to look at when trying to determine high school graduation rates. We focused our attention on how schools allocate financial resources across different areas, and how that allocation affects graduation rates.

Hypothesis 4: Schools that spend more money on instruction tend to have better graduation rates compared to schools that have higher support spending or adminstrative costs.

Variables:

• Total Instructional Spending, per pupil ($) - total spending dedicated to instruction • Total Instructional Spending on Salaries and Wages, per pupil ($) - salaries and wages dedicated to instruction

• Total Instructional Spending on Benefits, per pupil ($) - benefits dedicated to instruction • Total Support Spending, per pupil ($) - expenditure on on secondary services in a school, such as administration, maintenance, transportation, and social services.

• Total Support Spending on Pupil Support, per pupil ($) - expenditure on service such as social services and record keeping • Total Support Spending on Instructional Staff Support, per pupil ($) - expenditure services such as curriculum development and trainings

• Total Support Spending on General Administration, per pupil ($) - expenditure on board of education and executive administration services • Total Support Spending on School Administration, per pupil ($) - expenditure on office of principal services

Data: we used school expenditure data for the 2011-2012 school year from http://www.census.gov/govs/school/historical_data_2012.html. All of these variables are at the school district level.

For more precise defintions of the instruction and support spending variables, see the appendix of this report: http://www2.census.gov/govs/school/12f33pub.pdf

3) Neighborhood

There are many variables at the neighborhood level that we could look at, but note that many of them are captured at the household level. For example, the percent of people with college degrees can indicate how educated the average household is, but also captures the general attitude towards education in that district.

Instead, we focused on a more geographic variable in the neighborhood environment: food access.

Hypothesis 5: Students that live in districts with greater food access are healthier and perform better in school, leading to higher graduation rates.

Variables:

• Percentage of tracts in a district that have low food access. (The precise definition of this variable be found in the documentation of the dataset that we used below, but generally a low food access tract is where a certain number of people in the tract are 1 mile away from a grocery store in urban areas, or 20 miles away from a grocery store in rural areas.)

Data: The data and documentation for determining low access districts can be found here: http://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data.aspx. The data is provided for the 2010 year at the tract level, which we aggregated up to the district level.

4) Region

Following the theme of increasing environment sizes, the next environment we considered is a student's region. The particular feature we focused on in this environment were weather patterns, which we operationalized in the data using temperature data.

Hypothesis 6: Schools in extreme temperature areas tend to have lower graduation rates compared to schools in more moderate climates because of higher absentee and school cancellation rates.

Variables:

• Average daily maximum temperature during 2011 (degrees Farenheit)
• Average daily minimum temperature during 2011 (degrees Farenheit)

Data: we used 2011 temperature data from http://www.ncdc.noaa.gov/temp-and-precip/climatological-rankings/, produced by the the National Centers for Environmental Information. The data is at the county level, which we merged onto the county and state codes of the highest overlapping tract for each district.

### Data Mapping¶

The image below shows how we combined the various datasets.

# Part II: Exploring the dependent variable: The graduation rate¶

## Major problems with the graduation rate data¶

Before we can explore the relationships between the data we collected and graduation rates, we need to understand and address any problems with our graduation rate data.

#### Problem 1 with the graduation rate data: Ranges were not cleaned properly¶

At first, we were going to use the graduation rates from the merged dataset that the organizers had provided us with because these graduation rates were supposedly cleaned. The original data had ranges and non-numeric characters in the dataset. To clean up the graduation rates, the organizers had removed all non-numeric characters and selected the median value for all ranges

However, we discovered an issue with how the graduation rates had been cleaned. We looked at the documentation of the graduation rates dataset and found that the some of the graduation rates had been masked because of privacy concerns. If the school had too few students, the graduation rate would either not be given, or it would be given as ranges of varying sizes depending on the size of the school. The different tiers of ranges are listed below:

Ranges Used for Reporting the Graduation Rate for that Group

<5 students in cohort: "PS" (i.e. the graduation rate is not given)

6-15 students in cohort: <50%, ≥50%

16-30 students in cohort: ≤20%, 21-39%, 40-59%, 60-79% ≥80%

31-60 students in cohort: ≤10%, 11-19%, 20-29%, 30-39%, 40-49%, 50-59%, 60-69%, 70-79%, 80-89%, ≥90%

61-200 students in cohort: ≤5%, 6-9%, 10-14%, 15-19%, 20-24%, 24-29%, 30-34%, 35-39%, 40-44%, 45-49%, 50-54%, 55-59%, 60-64%, 65-69%, 70-74%, 75-79%, 80-84%, 85-89%, 90-94%, ≥95%

More than 200 students in cohort: ≤1%, 2%, 3%, . . . , 98%, ≥99%

(Note: the documentation says that the last tier of the ranges (>200 students) is actually for >300 students. But if we look at the data, it turns out that any district with >200 students has only a single number given for the graduation rate, and any district with 61-200 students has a 4- or 5-number range (ex. 11-19%))

Now, the organizers claimed that when there was a range, they calculated the median value. The issue is that not all of the ranges were given as explicit ranges. For example, 6-9% or 11-19% would be given as 6-9 or 11-19. The organizers calculated the medians of these ranges correctly. In this example, they would have returned 7.5 and 15, respectively, as the values.

But the database also had ranges such as less than 20% or greater than or equal to 50%, which were given as LT20 ("less than" 20) or GE50 ("greater than or equal to" 50). In these cases, the organizers merely removed the non-numeric characters and returned 20 and 50 as the graduation rates (respectively). If they had applied the same methodology to ranges consistently, they should have calculated the median of 0-20 and 50-100, which would have returned 10 and 75.

The issue is demonstrated in the tables below.

In [713]:
# read in the merged and unmerged dataset
districts_merged.shape

desired_cols = districts_merged.columns[0:27]
districts_merged = districts_merged[desired_cols]
districts_merged.rename(columns = {"ALL_RATE_1112": "clean_grad_rate"}, inplace = True)

districts_unmerged.columns

print "Table 1: Grad Rates for Schools with >200 Cohort Size (Organizer's Data)"
print

print "Table 2: Grad Rates for Schools with 61-200 Cohort Size (Organizer's Data)"
print

print "Table 3: Grad Rates for Schools with 31-60 Cohort Size (Organizer's Data)"
print

print "Table 4: Grad Rates for Schools with 16-30 Cohort Size (Organizer's Data)"
print

print "Table 5: Grad Rates for Schools with 6-15 Cohort Size (Organizer's Data)"
print

print "Table 6: Grad Rates for Schools with 1-5 Cohort Size (Organizer's Data)"

Table 1: Grad Rates for Schools with >200 Cohort Size (Organizer's Data)
0  100005          268                 83            83
1  100006          424                 79            79
2  100007         1042                 91            91

Table 2: Grad Rates for Schools with 61-200 Cohort Size (Organizer's Data)
4  100011          117                 72         70-74
5  100012          155                 92         90-94
8  100060          109                 77         75-79

Table 3: Grad Rates for Schools with 31-60 Cohort Size (Organizer's Data)
76   101980           49               74.5         70-79
81   102130           44               84.5         80-89
134  200004           31               34.5         30-39

Table 4: Grad Rates for Schools with 16-30 Cohort Size (Organizer's Data)
135  200007           20               69.5         60-79
138  200030           16               69.5         60-79
140  200060           26               80.0          GE80

Table 5: Grad Rates for Schools with 6-15 Cohort Size (Organizer's Data)
151  200300            9                 50          GE50
154  200450           15                 50          GE50
167  200680           11                NaN          LT50

Table 6: Grad Rates for Schools with 1-5 Cohort Size (Organizer's Data)
136  200010            3                NaN            PS
152  200360            1                NaN            PS
166  200670            5                NaN            PS


In the above tables, we can see how the organizers handled the graduation rates. The "Raw Grad Rate" column indicates the original value in the dataset. The "Cleaned Grad Rate" shows how the organizers cleaned the raw graduation rate.

For ranges indicated with hyphens (such in Tables 2, 3, and 4), the organizers took the median of the two numbers. But some of the ranges are indicated with inequality operators, such as GE80 (greater than or equal to 80) or LT50 (less than 50) (see tables 4 and 5). The median value should have been calculated for these ranges as well, but they were not. Instead, the non-numeric characters were removed without the median calculation.

So for example, GE80 became just 80, even though the median value between 80 and 100 is 90. Similarly, LT50 became just 50, even though the median value between 0 and 50 is 25.

After realizing this, we decided to clean up the grad rate ourselves. The following charts show how our cleaned graduation rate compares.

In [714]:
grad_rates = pd.read_csv("final_district_dataset.csv")

compare_grad_rates.rename(columns = {"leaid11": "LEA ID", "ALL_COHORT_1112": "Cohort Size",

print "Table 7: Grad Rates for Schools with >200 Cohort Size (Our Data)"
print

print "Table 8: Grad Rates for Schools with 61-200 Cohort Size (Our Data)"
print

print "Table 9: Grad Rates for Schools with 31-60 Cohort Size (Our Data)"
print

print "Table 10: Grad Rates for Schools with 16-30 Cohort Size (Our Data)"
print

print "Table 11: Grad Rates for Schools with 6-15 Cohort Size (Our Data)"
print

Table 7: Grad Rates for Schools with >200 Cohort Size (Our Data)
0  100005          268                 83            83
1  100006          424                 79            79
2  100007         1042                 91            91

Table 8: Grad Rates for Schools with 61-200 Cohort Size (Our Data)
4  100011          117                 72         70-74
5  100012          155                 92         90-94
8  100060          109                 77         75-79

Table 9: Grad Rates for Schools with 31-60 Cohort Size (Our Data)
76   101980           49               74.5         70-79
81   102130           44               84.5         80-89
134  200004           31               34.5         30-39

Table 10: Grad Rates for Schools with 16-30 Cohort Size (Our Data)
135  200007           20               69.5         60-79
137  200030           16               69.5         60-79
139  200060           26               90.0          GE80

Table 11: Grad Rates for Schools with 6-15 Cohort Size (Our Data)
150  200300            9                 75          GE50
152  200450           15                 75          GE50
164  200680           11                 25          LT50



The above charts show how we cleaned the data. For ranges indicated by hyphens (such as tables 2, 3, and 4), we calculated the median values. Similarly, we also found the median value for ranges indicated by inequality operators We think this is a better representation of the graduation rates.

Note that we don't have a table for schools with cohort sizes of 1-5 since graduation rates were not provided for these schools.

#### Problem 2 with the graduation rate data: Low Variance¶

The second problem with the graduation rate data is that our variance is very low. The following code looks at the summary statistics of the dependent variable.

In [715]:
grad_rates.clean_grad_rate.describe()

Out[715]:
count    9802.000000
mean       85.571261
std        10.357597
min         2.500000
25%        80.250000
50%        88.000000
75%        92.000000
max        99.500000
Name: clean_grad_rate, dtype: float64

These statistics show that the 1st quartile of the graduation rates is 80%. This means that 75% of the grad rates are between 80% and 100%, indicating low variance in the dependent variable. In other words, the dependent variable is concentrated in a narrow range of values

So this means that even a difference of 1 percentage point will make a big difference. But since many of our dependent variables are given in ranges, we lose precision in the dependent variable, making our predictive models weaker.

But note that for schools with cohorts greater than 200 students, the graduation rate is given as a single, precise number, not as a range. If we we could limit our data to just schools with over 200 students in their cohort, we would get much more precision and be able to measure effects of our key variables with greater accuracy.

In order to see if we can legitimately limit our data to schools with cohorts >200 students, we need to look at three things:

1. What portion of schools do schools with >200 cohort size represent?
2. What portion of total students are in schools with >200 students in their cohort?
3. Are the graduation rates for schools with >200 cohort size systematically different from the typical graduation rates across schools?

We examine each of these questions in more detail.

##### What portion of schools have over 200 students in their cohort?¶

The following code show a graph which breaks down the distribution of schools by cohort sizes. The cohort size buckets that we use are the tiers that determine how precise the graduation rate is (>200, 61-200, 31-60, 16-30, and 6-15).

In [716]:
import matplotlib.pyplot as plt

variable_names = ["1-15",
"16-30",
"31-60",
"61-200",
">200"]

variable_values = [percent_schools_less_than_16_cohort, percent_schools_16_to_30_cohort,
percent_schools_31_to_60_cohort, percent_schools_61_to_200_cohort,
percent_schools_greater_than_200_cohort]

f, ax = plt.subplots(figsize=(12, 6))
sb.barplot(x = variable_names , y = variable_values)

ax.set(ylabel="Percent of Total Schools",
xlabel="Schools Cohort Size",
title="Percentage of Total Schools by Cohort Sizes")

rects = ax.patches
labels = ["{}%".format(round(value, 1)) for value in variable_values]

for rect, label in zip(rects, labels):
height = rect.get_height()
x = rect.get_x() + rect.get_width()/2
ax.annotate(label, xy = (rect.get_x(), rect.get_y()), xytext = (x, height + .5), ha = "center", va="bottom")



As we can see, schools with >200 cohort sizes represent 35% of all schools, second only to schools with cohort size ranging between 61-200, which is slightly higher. It's not a lot of schools, but it's not a little either.

##### How many students (as a proportion of the total number of students) are in schools with >200 students in their cohort?¶

Next, we look at the distribution of students across schools of different cohort sizes. Basically, we want to see how many students as a proportion of total students are in schoools with >200 cohort size, 61-200 cohort size, etc. The following graph shows this breakdown.

In [717]:
import matplotlib.pyplot as plt

variable_names = ["1-15",
"16-30",
"31-60",
"61-200",
">200"]

variable_values = [percent_of_students_in_schools_less_than_16_cohort, percent_of_students_in_schools_with_16_to_30_cohort,
percent_of_students_in_schools_with_31_to_60_cohort, percent_of_students_in_schools_with_61_to_200_cohort,
percent_of_students_in_schools_greater_than_200_cohort]

f, ax = plt.subplots(figsize=(12, 6))
sb.barplot(x = variable_names , y = variable_values)

ax.set(ylabel="Percent of Total Students",
xlabel="Schools of Cohort Size",
title = "Percentage of Total Students by Cohort Sizes")

rects = ax.patches
labels = ["{}%".format(round(value, 1)) for value in variable_values]

for rect, label in zip(rects, labels):
height = rect.get_height()
x = rect.get_x() + rect.get_width()/2
ax.annotate(label, xy = (rect.get_x(), rect.get_y()), xytext = (x, height + 1), ha = "center")



This graph is telling us good news. It's saying that 85% of all students are in schools with >200 cohort size. This strengthens our case to use only data for schools with >200 students in their cohort. Now, for the last question...

##### Are graduation rates systematically different for schools with cohorts >200?¶

Basically, we want to know if there is something about larger schools that makes them different from other schools. If so, we would introduce bias into our analysis if we limited our data to just schools with >200 students in their cohort.

To check this, we compared baseline statistics for graduation rates for a) all schools, b) schools with less than or equal to 200 students in the cohort, and c) schools with greater than 200 students in their cohort.

In [718]:
print "Table 12: Summary statistics for graduation rates for all schools"

print

print "Table 13: Summary statistics for graduation rates for schools with <=200 cohort size"

print

print "Table 14: Summary statistics for graduation rates for schools with >200 cohort size"

Table 12: Summary statistics for graduation rates for all schools
count    9802.000000
mean       85.571261
std        10.357597
min         2.500000
25%        80.250000
50%        88.000000
75%        92.000000
max        99.500000

Table 13: Summary statistics for graduation rates for schools with <=200 cohort size
count    6362.000000
mean       86.206303
std        10.310099
min         2.500000
25%        82.000000
50%        90.000000
75%        92.000000
max        97.500000

Table 14: Summary statistics for graduation rates for schools with >200 cohort size

Out[718]:
count    3440.000000
mean       84.396802
std        10.344128
min        18.000000
25%        79.000000
50%        86.000000
75%        92.000000
max        99.500000
Name: clean_grad_rate, dtype: float64

As we can see, the graduation rates for all schools and graduation rates for schools with cohort size <=200 are very similar to the graduation rates for schools with >200 cohort sizes. The means and standard deviations are about the same. Again, this strengthens the case for limiting our data to schools with >200 cohort sizes.

##### So what should we do?¶

Schools with >60 cohort sizes represent 70% of all schools and 96% of all students (these numbers come from the graphs we made). But for all schools with 61-200 cohort sizes, the graduation rate data is given in ranges, which reduces our precision and thus the accuracy of effects we estimate for our independent variables.

On the other hand, schools with >200 cohort sizes represent 35% of all schools and 84% of all students. They don't represent as much data as the >60 subset, but all the graduation rates are precise, single numbers, which improves our precision and accuracy of estimates.

So, there is a trade-off between how much data we use in our analysis, and how precise that data is. If we use schools >200, we have less data but more precise data (and thus more accurate estimates). If we use schools >60, we have more data, but less precise data.

Since >200 cohort schools are not systematically different from other schools, we chose to build a model only for schools with cohorts >200. However, to make sure that this model is generalizable to all schools, we compared it to a model built on >60 cohort schools to make sure they seem generally similar. As we will see, we found that they are generally the same.

# Part III: Exploring the Relationships between independent and dependent variables¶

Next, we move on to exploring the relationship between our key indepdentent variables and graduation rates. The following dynamic chart shows the correlations between our variables of interest and graduation rates. Feel free to play around with it!

Note that we also offer static visualizations below this chart that also show a line of best fit. Feel free to skip ahead to those graphs.

In [719]:
import charts

series = [{
"name": "Married Households (% of total HHs)",
}, {
"name": "Female Householder Only (no husband) Households (% of total HHs)",
}, {
"name": "Renter Occupied Households  (% of total households)",
},{
"name": "Households where Householder Moved in During Past Year  (% of total households)",
}, {
"name": "People who had a different household 1 year ago  (% of people >1 year old)",
}, {
}, {
"name": "People Below Poverty level  (% of total people in district)",
}, {
"name": "Weighted Median Neighborhood Income ($)", "data": zip(grad_rates_greater_than_200_cohort.weighted_Med_HHD_Inc_ACS_08_12, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Low Food Access (% of total overlapping tracts that have low food access))", "data": zip(grad_rates_greater_than_200_cohort.LAhalfand10, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Average Daily Maximum Temperature (Farenheit)", "data": zip(grad_rates_greater_than_200_cohort.avg_of_daily_max, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Average Daily Minimum Temperature (Farenheit)", "data": zip(grad_rates_greater_than_200_cohort.avg_of_daily_min, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Instruction Spending on Salaries and Wages, Per Pupil ($)",
}, {
"name": "Instruction Spending on Benefits, Per Pupil ($)", "data": zip(grad_rates_greater_than_200_cohort.instruction_benefits_per_pupil, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Support Spending on Pupil Support, Per Pupil ($)",
}, {
"name": "Support Spending on Instructional Staff Support, Per Pupil ($)", "data": zip(grad_rates_greater_than_200_cohort.support_instructional_staff_per_pupil, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Support Spending on General Administration, Per Pupil ($)",
}, {
"name": "Support Spending on School Administration, Per Pupil ($)", "data": zip(grad_rates_greater_than_200_cohort.support_school_admin_per_pupil, grad_rates_greater_than_200_cohort.clean_grad_rate) }, { "name": "Support Spending on School Administration, Per Pupil ($)",
}
]

options = dict(height=600,