Guest post by Kristin E. Porter, Tejomay Gadgil, Sara Schell, Megan McCormick and Richard Hendra, MDRC.

## Predictive analytics at MDRC

For more than 40 years, MDRC, a nonprofit, nonpartisan education and social policy research organization, has been a leader in pioneering the most rigorous research methods in social science research and in sharing what we have learned with the field. In this blog post, we describe how MDRC’s rigorous approach to methodology and data processing is reflected in our approach to predictive analytics, which we believe led to our first place performance in the two Fragile Families Challenge domains where we submitted models.

MDRC works with a wide variety of government agencies, nonprofits, and other social service providers to help them harness their data to better understand patterns of behavior, figure out what works, better manage caseload dynamics, and better target individuals for interventions. In particular, we are using predictive analytics to identify individuals’ likelihoods of achieving key outcomes, such as reaching a program participation milestone, finding employment, or reading at a proficient level.

MDRC researchers have developed a comprehensive predictive analytics framework that allows for rapid and iterative estimation of likelihoods (probabilities between 0 and 1) of adverse or positive outcomes. The framework includes analytic steps focused on (1) identifying the best samples for training statistical models and computing predictions; (2) processing and cleaning data; (3) creating and curating measures to include in modeling; (4) identifying the best modeling methods with an emphasis on ensembling; (5) estimating uncertainty in predictions; and (6) summarizing and interpreting results.

## MDRC’s approach to the Fragile Families Challenge.

MDRC applied several analytic steps in our predictive analytics framework to the Fragile Families Challenge (FFC) — those focused on data processing, creating and curating measures, and modeling methods. (The other steps simply did not apply given the nature of the challenge.) The following describes the underlying premises that guided our analyses:

**1. Invest deeply in measure creation — combining both substantive knowledge and automated approaches. **

At MDRC, about 90 percent of the effort in any predictive analysis is dedicated to creating measures that extract as much predictive information as possible from the raw data. Doing so requires both subject matter expertise and familiarity with the data collection processes and context. It also involves recognizing opportunities for encoding information that may seem irrelevant or ancillary.

Extracting information can involve creating new, aggregate measures that summarize across multiple raw measures. Luckily, the FFC data already includes many valuable “constructed variables” that summarize raw survey responses (for instance, a measure of whether a mother meets depression criteria was constructed from multiple individual questions). There were other opportunities to create more aggregate measures as well. However, doing so can be very time-consuming when the number of raw measures is large. Relying on subject matter knowledge to prioritize which aggregate measures will likely be most predictive is key.

Extracting information can also involve the collapsing of categories from a single measure. For example, measures from survey questions asking “Household member’s relationship to you” has 18 possible non-missing values (spouse, partner, respondent’s mother, etc.). These 18 values can be grouped into types of relationships that are meaningful when it comes to predicting a particular outcome. Subject matter knowledge about the population and the outcome of interest can be helpful in determining the best groupings (for instance, does it matter whether the household member is an adult or does the particular kind of relationship matter?). However, automated algorithms are also an essential tool. Such algorithms can mine text in the responses, do clustering, and/or check the distributions of response choices to inform grouping selections. We have developed functions that process hundreds of variables with similar structures and transform them with just a few lines of code. Combining these approaches with subject area judgement can produce powerful results.

**2. “Missingness” is informative and should not be “imputed away.”**

In the FFC, we did no imputation of missing values, and we did not delete observations with missing values. In the case of predictive analytics, MDRC views missing values as containing predictive information. That is, the missingness may be for unmeasured reasons that correlate with the outcome of interest. Imputation would overwrite this information, often with inaccurate information, as even the most sophisticated techniques rest on unverifiable assumptions.

Therefore, we coded all measures in the FFC data into a series of dummy variables. Each dummy corresponds to a response or grouping of responses, including those related to missingness. For example, on measure in the mother questionnaire – “have a legal agreement or child support order” – we created three dummies that capture underlying reasons for missingness, as well as a dummie that captured the nonmissing response. We note here that by combining missingness codes, we are making assumptions that different types of missing have similar predictive value.

**3. Eliminate unhelpful measures. **

Because the number of measures available in the FFC data is large and because the coding of the survey responses was consistent across the measures, it made a lot of sense to automate the dummy creation described above. This multiplied the already large number of provided measures manifold. Not all of the resulting dummies held useful information. Therefore, we approached measure reduction as follows:

- We only used measures from the mother, father and primary caregiver questionnaires, as these seemed to contain information relevant to the outcomes on which we were focusing (job training and eviction). When the same question was asked to all three, we only used the response from the questionnaire that corresponded to the primary caregiver at the age 9 follow-up (based on pcg5idstat). In doing this, we assumed the primary caregiver at the age 15 follow-up would be the same as the primary caregiver at the age 9 follow-up. If pcg5idstat was missing, we assumed the mother was the primary caregiver at the age 9 follow (as this was the case for 91 percent of the nonmissing responses). We included measures for the same primary caregiver in all previous waves.
- Due to automation of dummy creation, we often ended up with dummies with only a very small number of 1’s or a very small number of 0’s. These measures held little useful information and we dropped them based on a custom filter.
- We also ended up with many highly correlated dummy variables. We dropped all but of a set of measures with a correlation greater than 0.9.

**4. Evaluate ‘learners’ based on out-of-sample performance. **

In MDRC’s predictive analytics framework, we define a “learner” as some combination of (1) a set of predictors, (2) a modeling method or machine learning algorithm, and (3) any tuning parameters for the corresponding machine learning algorithm. For example, one learner might be defined the Random Forest algorithm using all of our dummy variables and with tuning parameter of the number of measures to select at each split set to 2.

We want to evaluate the performance of each learner based on how well it does when making predictions in new data — data not used for training or fitting the algorithm or model. Therefore, we use v-fold cross validation to mimic repeatedly fitting a model for a particular learner in one sample and then evaluating it in a different sample. For the FFC, we used 5-fold cross-validation. That is, we partitioned the training data into 5 folds (subsamples). We fit all learners in all but one of the folds. In the left-out “validation” fold, we computed predictions with each trained learner and computed the performance of each learner based on those predictions. The performance measure in the case of the FFC was Brier loss. We repeated the whole process 5 times such that each fold took a turn as the validation fold. The averages of the Brier loss scores were computed across all validation folds. (The entire process can be repeated multiple times in order to reduce the variance of the cross-validated estimates.)

For any given prediction problem, we cannot know which learner will perform best. Therefore, we define many learners. For the FFC, we ultimately defined only one set of predictors (which was all dummies we created), but we tried many machine learning algorithms designed for binary outcomes, and for many of the machine learning algorithms, we specified many combinations of tuning parameters.

**5. Combine results from different learners with ensemble learning. **

For our final model, we can select the learner with the best out-of-sample performance – the one with the lowest cross-validated Brier loss. Alternatively, we can combine multiple learners in order to improve the performance than could be achieved from any single learner. This is referred to as ensemble learning. Many of the algorithms commonly used in predictive analytics, such as Random Forest and Gradient Boosting Machine algorithms, are examples of ensemble learning. However, we can also ensemble across these and other algorithms or learners (in our case, combinations of algorithms and tuning parameter specifications). There are multiple approaches to ensemble learning. Perhaps the more common approach is stacking, or Super Learning (van der Laan, Polley and Hubbard, 2007).^{1} Our implementation of stacking produced an error at the last minute so our FFC submission relied on predictions from the best-performing learner. However, ensemble learning has the potential to further improve our results.

## More about MDRC

MDRC is committed to finding solutions to some of the most difficult problems facing the nation — from reducing poverty and bolstering economic self-sufficiency to improving public education and college graduation rates. We design promising new interventions, evaluate existing programs using the highest research standards, and provide technical assistance to build better programs and deliver effective interventions at scale. We work as an intermediary, bringing together public and private funders to test new policy-relevant ideas, and communicate what we learn to policymakers and practitioners — all with the goal of improving the lives of low-income individuals families and children.

For more about predictive analytics at MDRC, check out:

- MDRC’s Approach To Using Predictive Analytics To Improve and Target Social Services Based on Risk
- Pairing Predictive Analytics with Implementation Research
- What Can Schools, Colleges and Youth Programs Do with Predictive Analytics
- Predictive Modeling of K-12 Academic Outcomes

^{1}van der Laan, M., Polley, E. & Hubbard, A. (2007). Super Learner. *Statistical Applications in Genetics and Molecular Biology*, 6(1). Retrieved 8 Nov. 2017, from doi:10.2202/1544-6115.1309