Our Blog

Our Blog

MDRC’s Approach to the Fragile Families Challenge

Uncategorized No comments

Guest post by Kristin E. Porter, Tejomay Gadgil, Sara Schell, Megan McCormick and Richard Hendra, MDRC.

Predictive analytics at MDRC

For more than 40 years, MDRC, a nonprofit, nonpartisan education and social policy research organization, has been a leader in pioneering the most rigorous research methods in social science research and in sharing what we have learned with the field. In this blog post, we describe how MDRC’s rigorous approach to methodology and data processing is reflected in our approach to predictive analytics, which we believe led to our first place performance in the two Fragile Families Challenge domains where we submitted models.

MDRC works with a wide variety of government agencies, nonprofits, and other social service providers to help them harness their data to better understand patterns of behavior, figure out what works, better manage caseload dynamics, and better target individuals for interventions. In particular, we are using predictive analytics to identify individuals’ likelihoods of achieving key outcomes, such as reaching a program participation milestone, finding employment, or reading at a proficient level.

MDRC researchers have developed a comprehensive predictive analytics framework that allows for rapid and iterative estimation of likelihoods (probabilities between 0 and 1) of adverse or positive outcomes. The framework includes analytic steps focused on (1) identifying the best samples for training statistical models and computing predictions; (2) processing and cleaning data; (3) creating and curating measures to include in modeling; (4) identifying the best modeling methods with an emphasis on ensembling; (5) estimating uncertainty in predictions; and (6) summarizing and interpreting results.

MDRC’s approach to the Fragile Families Challenge.

MDRC applied several analytic steps in our predictive analytics framework to the Fragile Families Challenge (FFC) — those focused on data processing, creating and curating measures, and modeling methods. (The other steps simply did not apply given the nature of the challenge.) The following describes the underlying premises that guided our analyses:

1. Invest deeply in measure creation — combining both substantive knowledge and automated approaches.

At MDRC, about 90 percent of the effort in any predictive analysis is dedicated to creating measures that extract as much predictive information as possible from the raw data. Doing so requires both subject matter expertise and familiarity with the data collection processes and context. It also involves recognizing opportunities for encoding information that may seem irrelevant or ancillary.

Extracting information can involve creating new, aggregate measures that summarize across multiple raw measures. Luckily, the FFC data already includes many valuable “constructed variables” that summarize raw survey responses (for instance, a measure of whether a mother meets depression criteria was constructed from multiple individual questions). There were other opportunities to create more aggregate measures as well. However, doing so can be very time-consuming when the number of raw measures is large. Relying on subject matter knowledge to prioritize which aggregate measures will likely be most predictive is key.

Extracting information can also involve the collapsing of categories from a single measure. For example, measures from survey questions asking “Household member’s relationship to you” has 18 possible non-missing values (spouse, partner, respondent’s mother, etc.). These 18 values can be grouped into types of relationships that are meaningful when it comes to predicting a particular outcome. Subject matter knowledge about the population and the outcome of interest can be helpful in determining the best groupings (for instance, does it matter whether the household member is an adult or does the particular kind of relationship matter?). However, automated algorithms are also an essential tool. Such algorithms can mine text in the responses, do clustering, and/or check the distributions of response choices to inform grouping selections. We have developed functions that process hundreds of variables with similar structures and transform them with just a few lines of code. Combining these approaches with subject area judgement can produce powerful results.

2. “Missingness” is informative and should not be “imputed away.”

In the FFC, we did no imputation of missing values, and we did not delete observations with missing values. In the case of predictive analytics, MDRC views missing values as containing predictive information. That is, the missingness may be for unmeasured reasons that correlate with the outcome of interest. Imputation would overwrite this information, often with inaccurate information, as even the most sophisticated techniques rest on unverifiable assumptions.

Therefore, we coded all measures in the FFC data into a series of dummy variables. Each dummy corresponds to a response or grouping of responses, including those related to missingness. For example, on measure in the mother questionnaire – “have a legal agreement or child support order” – we created three dummies that capture underlying reasons for missingness, as well as a dummie that captured the nonmissing response. We note here that by combining missingness codes, we are making assumptions that different types of missing have similar predictive value.

3. Eliminate unhelpful measures.

Because the number of measures available in the FFC data is large and because the coding of the survey responses was consistent across the measures, it made a lot of sense to automate the dummy creation described above. This multiplied the already large number of provided measures manifold. Not all of the resulting dummies held useful information. Therefore, we approached measure reduction as follows:

  • We only used measures from the mother, father and primary caregiver questionnaires, as these seemed to contain information relevant to the outcomes on which we were focusing (job training and eviction). When the same question was asked to all three, we only used the response from the questionnaire that corresponded to the primary caregiver at the age 9 follow-up (based on pcg5idstat). In doing this, we assumed the primary caregiver at the age 15 follow-up would be the same as the primary caregiver at the age 9 follow-up. If pcg5idstat was missing, we assumed the mother was the primary caregiver at the age 9 follow (as this was the case for 91 percent of the nonmissing responses). We included measures for the same primary caregiver in all previous waves.
  • Due to automation of dummy creation, we often ended up with dummies with only a very small number of 1’s or a very small number of 0’s. These measures held little useful information and we dropped them based on a custom filter.
  • We also ended up with many highly correlated dummy variables. We dropped all but of a set of measures with a correlation greater than 0.9.

4. Evaluate ‘learners’ based on out-of-sample performance.

In MDRC’s predictive analytics framework, we define a “learner” as some combination of (1) a set of predictors, (2) a modeling method or machine learning algorithm, and (3) any tuning parameters for the corresponding machine learning algorithm. For example, one learner might be defined the Random Forest algorithm using all of our dummy variables and with tuning parameter of the number of measures to select at each split set to 2.

We want to evaluate the performance of each learner based on how well it does when making predictions in new data — data not used for training or fitting the algorithm or model. Therefore, we use v-fold cross validation to mimic repeatedly fitting a model for a particular learner in one sample and then evaluating it in a different sample. For the FFC, we used 5-fold cross-validation. That is, we partitioned the training data into 5 folds (subsamples). We fit all learners in all but one of the folds. In the left-out “validation” fold, we computed predictions with each trained learner and computed the performance of each learner based on those predictions. The performance measure in the case of the FFC was Brier loss. We repeated the whole process 5 times such that each fold took a turn as the validation fold. The averages of the Brier loss scores were computed across all validation folds. (The entire process can be repeated multiple times in order to reduce the variance of the cross-validated estimates.)

For any given prediction problem, we cannot know which learner will perform best. Therefore, we define many learners. For the FFC, we ultimately defined only one set of predictors (which was all dummies we created), but we tried many machine learning algorithms designed for binary outcomes, and for many of the machine learning algorithms, we specified many combinations of tuning parameters.

5. Combine results from different learners with ensemble learning.

For our final model, we can select the learner with the best out-of-sample performance – the one with the lowest cross-validated Brier loss. Alternatively, we can combine multiple learners in order to improve the performance than could be achieved from any single learner. This is referred to as ensemble learning. Many of the algorithms commonly used in predictive analytics, such as Random Forest and Gradient Boosting Machine algorithms, are examples of ensemble learning. However, we can also ensemble across these and other algorithms or learners (in our case, combinations of algorithms and tuning parameter specifications). There are multiple approaches to ensemble learning. Perhaps the more common approach is stacking, or Super Learning (van der Laan, Polley and Hubbard, 2007).1 Our implementation of stacking produced an error at the last minute so our FFC submission relied on predictions from the best-performing learner. However, ensemble learning has the potential to further improve our results.

More about MDRC

MDRC is committed to finding solutions to some of the most difficult problems facing the nation — from reducing poverty and bolstering economic self-sufficiency to improving public education and college graduation rates. We design promising new interventions, evaluate existing programs using the highest research standards, and provide technical assistance to build better programs and deliver effective interventions at scale. We work as an intermediary, bringing together public and private funders to test new policy-relevant ideas, and communicate what we learn to policymakers and practitioners — all with the goal of improving the lives of low-income individuals families and children.

For more about predictive analytics at MDRC, check out:

1van der Laan, M., Polley, E. & Hubbard, A. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1). Retrieved 8 Nov. 2017, from doi:10.2202/1544-6115.1309

Submission Description by Brian J. Goode – Imputing Values and Feature Reasoning

Uncategorized No comments

This guest blog post is written by Brian J. Goode, Discovery Analytics Center, Virginia Tech. The author was a winner of an Innovation Award.

Overview

One of the primary challenges of the Fragile Families Challenge (FFC) was to create a robust submission that is able to handle missing data. Of the nearly 44 million data points in the feature set, 55% of these values were either null, missing, or otherwise marked as incomplete. Discarding these data amounts to a substantial amount of information loss, and can potentially skew the data if there is any systematic reason as to why the nulls appear in the rows that they do. Imputing missing values preserves information content that was present, but introduces specific assumptions on the imputed values that may not always be verifiable. To the degree possible, the submission titled ‘bjgoode’ made use of the survey questionnaire to establish imputation rules based on the survey structure and familial proximity. As a result of this, the number of missing values decreased to 38% of the data set. The remaining missing data were filled in with the most frequent values. The implementation is straightforward, but tedious. The procedure is described below and resources are given at the bottom of this article. Results are given by the Fragile Families Challenge, but much work still needs to be done to evaluate the efficacy of the approach.

Procedure

There are four different approaches taken to impute values as part of this submission:

Figure 1. The various pathways for filling in missing data are shown in this diagram. The order of imputing values begins within each survey. Then Cross M-F imputing is completed. Finally, Cross year substitutions are made. The procedure reduced the number of missing values from 55% to 38% of the entire dataset.

1. Within Survey.

Some surveys, such as the mother/father baseline survey, have multiple pathways that an interview can follow depending on specific circumstances.
For example, there are whole blocks of questions that will be answered or not on the basis of whether or not the parents are romantically involved, partially romantically involved, and married. This means that by survey design, we can deduce that some questions are meant to have null values due to the pathway that was taken. For questions that are specific to the circumstance, there is little we can do. However, there were a number of repeated questions within these surveys that can be cross-linked. An example of this from the mother baseline survey (B5, B11, B22) is:

I’m going to read you some things that couples often do together. Tell me which ones you and [BABY’S FATHER] did during the last month you were together.

These questions were identified by text matching. When a value appeared in one path, it was transferred to the same question in the missing pathway. This reduced the number of missing values in the feature data from 55% to 51% missing.

2. Cross M-F.

One other reason for having missing values is that only one parent is actively involved in the study. For these cases, there is likely to be only one survey out of the mother, father, or primary caregiver surveys for a given survey wave.
In this case, we can impute data by finding related questions in each of these surveys with each wave. The value was transferred to each other matching survey question. The result of this is that some survey questions, instead of being answered strictly as mother, father, or primary caregiver are answered from the more general prototype of ‘supporting adult’ when viewed in isolation. If the data were used to form a complex structure of a specific parent, the non-trivial assumption is that the parents would have answered similarly. During each wave, the format and structure of the mother/father surveys were very similar within each section. Using this, it was faster and more accurate to do the mapping by hand by specifying question ranges. The mappings are provided in the Github repository listed in the Resources section. This procedure was performed twice. The first iteration reduced the output from the within survey mapping from 51% to 45% missing values. The second iteration reduce the output from the cross year mapping from 39% to 38% missing values.

3. Cross Year (wave).

Missing values also appeared to be more common during the later waves of the survey. This is not surprising given that it is a longitudinal survey and there are expected to be dropouts. To fill in these values, an assumption was made that it is more likely that a given mother/father survey response will remain persistent across time than not. As a cautionary note, this is a very strong assumption, especially for questions that have a smaller time scale. To avoid probing into specific questions, and assessing whether or not to use the latest known value or some other method for imputing, the mean value across available years was taken (note: all answers to survey questions were encoded with ordinal values).

The challenge here was to identify the related questions across years, because the same question was noted to be worded multiple different ways over multiple waves. This type of matching problem also has the characteristic that it is too cumbersome as a one-off instance to train an algorithm, yet also too tedious and error-prone for a human to match. The solution was to create a simple algorithm based on the NLTK Natural Language Processing Tookit in Python to identify similar questions by text. Having too few samples to properly train, a simple threshold was used to cluster questions into groups of related categories. However, thresholds have the ability to be both too conservative and too liberal in the error depending on the text and the type of changes that were made. Therefore, humans in the loop were included by having the script “propose” both correct and incorrect survey items within each cluster. A sample output is given here:

Figure 2. Example of output code for sets of related questions.

This process was much simpler and required little effort. However, without a gold standard for comparison, the exact accuracy of the algorithm cannot be stated with confidence. The code is available on the Github repository in the Resources section. Of the missing values, after the first cross M-F matching, the cross year matching reduced the number of missing values from 45% to 39%.

4. Output Specific.

The last major addition to the imputing strategy was to mimic the output measure being investigated as best as possible. All of the model outputs (outcomes) were derived from survey features and made public on the Fragile Families Blog (e.g., Material Hardship). For most of the outputs, except for GPA, there was a history of previous responses from surveys in previous waves. Therefore, for each of the outputs, a feature was made to correspond to the output in each survey where applicable. This was particularly helpful for features like Material Hardship that were formed from multiple survey questions and had the added effect of acting like an “OR”. Consequently, this is where the biggest performance gain was seen, but had little effect on the number of missing values.

After the steps were applied above, the remaining 38% of missing values were imputed with the most frequent value from each feature. All features exhibiting no entropy (all same values) were removed.

Results

The training and validation phase of the modelling phase showed that linear regression models were best for ordinal outputs: GPA, Grit, and Material Hardship. The remaining model outputs were best fit by logistic regressions. Although L1-regularization was implemented, for many of the outputs, the features were reduced to include only subjectively relevant features. For the case of Grit and Material Hardship, the features corresponding to the definition of the measure were picked. The feature combinations are too many to list here, but are shown in the code linked by the Resources section. Admittedly, this is not a fully automated procedure nor one grounded in theory, and is very likely to vary between researchers. However, I contend that this is evidence that we need to consider the larger model-system that includes both model design and resource constraints such as time. This will help us better understand how model development decisions impact the result and final implementation.

To fully understand the cost/benefit of the above imputation strategy one would need to conduct an ablation study and include other methods of imputation. Due to time constraints, that was not possible. But, from the design, matching the outputs appeared to show the greatest performance increase during the validation phase. As an approximate indicator of performance, the mean squared error (mse) and rank of each model using this data set is provided relative to the baseline here: FFC Results. Of note, the model is ranked 5th and 9th in the Material Hardship and Layoff outcomes respectively, but there were many better performing models. So, there is still an open question of the utility of this strategy in terms of overall performance, interpretability of imputing, and similarity of individual sample outputs.

What Next?

The work described above focuses on how data was imputed and selected to fill in missing values for the Fragile Families Challenge. However, more detailed analysis needs to be completed in order to reason about the strategy (or any strategy) with respect to the data, the challenge results, and the models themselves. This is currently ongoing and anticipated to be discussed in the forthcoming Socius submission as well as during the talk at the FFC Workshop on November 16th, 2017.

Author Details

Brian J. Goode, Discovery Analytics Center, Virginia Tech

I would like to Dichelle Dyson and Samantha Dorn for their help.

Resources

Github Repository: https://github.com/bjgoode/ffc-public

Upcoming event: The Future of Big Data and Survey Methods

Uncategorized No comments

We are excited to have a chance to discuss the Fragile Families Challenge as part of a panel at the University of Michigan, Institute for Social Research. The title of the panel is: The Future of Big Data and Survey Methods. Please join us at the event. More information is below.

Description:
New Data Science methods and mass collaborations pose both exciting opportunities and important challenges for social science research. This panel will explore the relationship between these new approaches and traditional survey methodology. Can they coexist, or even enrich one another? Matthew Salganik is one of the lead organizers of the Fragile Families Challenge, which uses data science approaches such as predictive modeling, mass collaboration, and ensemble techniques. Jeremy Freese is co-PI of the General Social Survey and of a project on collaborative research in the social sciences. Colter Mitchell has conducted innovative work combining biological data and methods with Fragile Families and other survey data sets.

Sponsored by the Computational Social Science Rackham Interdisciplinary Workshop and the Population Studies Center’s Freedman Fund.

Friday, 10/6/2017, 3:10 PM
Location: 1430 ISR-Thompson

Correction to prize winners

Uncategorized No comments

When newspapers have to correct a published article, they issue a correction that notes the errors in the prior version and how they have been corrected. Following this logic, this blog post explains a correction we have made to the prize winners blog.

At the close of the Challenge, one team (MDRC) mistakenly believed that the submission deadline, listed as 6pm UTC on Codalab, was 6pm Eastern Time. After the close of the Challenge at 2pm, they were unable to upload their submission. They emailed us very soon after the 2pm deadline indicating that they had misunderstood. Our Board of Advisors reviewed the case carefully and decided to accept this submission. We made this decision before we opened the holdout data.

When we actually evaluated the submissions with the holdout data, we downloaded all final submissions from Codalab and neglected to add the e-mailed MDRC submission to the set. The team noticed they were not on the final scores page and emailed us to ask. A week after opening the holdout set, we added their submission to the set, re-evaluated all scores, and discovered that this team had achieved the best score in eviction and job training, two prizes we had already awarded to other teams.

In consultation with our Board of Advisors, we decided to do three things.

First, we updated the final prize winners to recognize MDRC.

Second, we recognized that this was an unusual situation. Other teams had rushed to the 2pm deadline and might have scored better with a few extra hours of work. For this reason, we decided to create a new category: special honorary prizes. If MDRC won for an outcome, the second-place team (i.e. the team that was in first place at the close of the Challenge at 2pm) would be awarded a special honorary prize.

Third, we updated the prize winners figure and score ranks to include MDRC along with all submissions previously included.

All prize winners (final, progress, innovation, foundational, and special honorary) are invited to an all-expense-paid trip to Princeton University to present their findings at the scientific workshop.

Prize winners

Uncategorized No comments

The Fragile Families Challenge received over 3,000 submissions from more 150 teams between the pilot launch on March 3, 2017, and the close on August 1, 2017. Each team’s final submission score on the holdout set is provided at this link. In this blog post, we are excited to announce the prize winners!

Final prizes

We are awarding prizes to the top-scoring submissions for each outcome, as measured by mean-squared error. The winners are:

  • GPA: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Grit: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Material hardship: haixiaow (Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi; Ph.D. students, Department of Politics, Princeton University)
  • Eviction: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)
  • Layoff: Pentlandians (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Job training: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)

Progress prizes

As promised, we are also awarding progress prizes to the top-scoring submissions for each outcome among submissions made by May 10, 2017 at 2pm Eastern Time. The following teams had the best submission as of this deadline are:

  • GPA: ovarol (Onur Varol, postdoctoral researcher at the Center for Complex Network Research, Northeastern University Networks Science Institute)
  • Grit: rap (Derek Aguiar, Postdoctoral Researcher, and Ji-Sung Kim, Undergraduate Student, Department of Computer Science, Princeton, NJ)
  • Material hardship: ADSgrp5
  • Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
  • Layoff: the_Brit (Professor Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK)
  • Job training: nmandell (Noah Mandell, Ph.D. candidate in plasma physics at Princeton University)

Foundational award

Greg Gunderson (ggunderson) produced machine-readable metadata that turned out to be very helpful for many participants. You can read more about the machine-readable metadata in our blog post on the topic. In addition to being useful to participants, this contribution was also inspirational for the Fragile Families team. They saw what Greg did and wanted to build on it. A team of about 8 people is now working to standardize aspects of the dataset and make more metadata available. Because Greg provided a useful tool for other participants, open-sourced all aspects of the tool, and inspired important changes that will make the larger Fragile Families project better, we are awarding him the foundational award.

Innovation awards

The Board of Advisers of the Fragile Families Challenge would also like to recognize several teams for particularly innovative contributions to the Challenge. For these prizes, we only considered teams that were not already recognized for one of the awards above. Originally, we planned to offer two prizes: “most novel approach using ideas from social science” and “most novel approach using ideas from data science.” Unfortunately, this proved very hard to judge because many of the best submissions combined data science and social science.

Therefore, after much deliberation and debate, we have decided to award two prizes to for innovation. These submissions each involved teams of people working collaboratively. Each team thought carefully about the raw data and cleaned variables manually to provide useful inputs to the algorithm, much as a social scientist typically would. Each team then implemented well-developed machine learning approaches to yield predictive models.

We are recognizing the following teams:

  • bjgoode (Brian J. Goode, Virginia Tech, acknowledging Dichelle Dyson and Samantha Dorn)
  • carnegien (Nicole Carnegie, Montana State University, and Jennifer Hill and James Wu, New York University)

We are encouraging these teams to prepare blog posts and manuscripts to explain their approaches more fully. To be clear, however, there were many, many innovative submissions, and we think that a lot of creative ideas were embedded in code and hard to extract from the short narrative explanations. We hope that all of you will get to read about these contributions and more in the special issue of Socius.

Special honorary prizes

As explained in our correction blog post, our Board of Advisors decided to accept a submission that arrived shortly after the deadline, because of confusing statements on our websites about the hour at which the Challenge closed. This team (mdrc) had the best score for two outcomes (eviction and job training) and was awarded the final prize for each of these outcomes. Because we recognize that this was an unusual situation, we are awarding special honorary prizes to the second-place teams for each of these outcomes.

  • Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
  • Job training: malte (Malte Moeser, Ph.D. student, Department of Computer Science, Princeton University)

Conclusion

Thank you again to everyone that participated. We look forward to more exciting results to come in the next steps of the Fragile Families Challenge, and we hope you will join us for the scientific workshop (register here) at Princeton University on November 16-17!

Understanding your score on the holdout set

Uncategorized No comments

We were excited to release the holdout scores and announce prize winners for the Fragile Families Challenge. Our guess is that some people were pleasantly surprised by their scores and that some people were disappointed. In this post, we provide more information about how we constructed the training, leaderboard, and holdout sets, and some advice for thinking about your score. Also, if you plan to submit to the special issue of Socius—and you should—you can request scores for more than just your final submission.

Constructing the training, leaderboard, and holdout set

In order to understand your score, it is helpful to know a bit more about how we constructed the training, leaderboard, and holdout sets. We split the data into three sets: 4/8 training, 1/8 leaderboard, and 3/8 holdout.

In order to make each dataset as similar as possible, we selected them using a form of systematic sampling. We sorted observations by city of birth, mother’s relationship with the father at birth (cm1relf), mother’s race (cm1ethrace), whether at least 1 outcome is available at age 15, and the outcomes at age 15 (in this order): eviction, layoff of a caregiver, job training of a caregiver, GPA, grit, and material hardship. Once observations were sorted, we moved down the list in groups of 8 observations at a time and, for each group, randomly selected 4 observations to be in the training set, 1 to be in the leaderboard set, and 3 to be in the holdout set. This systematic sampling helped reduce the chance of us getting a “bad draw” whereby the datasets would differ substantially due to random chance.

All three datasets—training, leaderboard, holdout—include cases for which no age 15 outcomes have been collected yet. We decided to include these cases because data might be collected from them in the future and for some methodological research it might be interesting to compare predictions even if the truth is not known.

For the cases with no outcome data in the leaderboard set—but not the training and holdout sets—we added random imputed outcome data. We did this by randomly sampling outcomes with replacement from the observed outcomes in the leaderboard set. For example, the leaderboard included 304 observed cases for GPA and 226 missing cases imputed by random sampling with replacement from the observed cases.

Randomly imputing outcome data is a bit unusual. Our main reason for setting up the leaderboard this way was to develop a method for assessing model overfitting without opening the holdout set. In scientific challenges like the Fragile Families Challenge, participants can continuously improve their leaderboard scores over time, providing the appearance of real progress in constructing a good model. But, when assessed with the holdout set, that progress turns out to be an illusion: the final score is much worse than expected. This scenario is what happens when participants overfit their models to the leaderboard data. Because of this property, the leaderboard is a bad measure of progress: it misleads participants about the quality of their models. So, when calculating leaderboard score we used both real outcome data and randomly imputed outcome data. The imputed subset is randomly drawn, which means that score improvement on those observations over time is a clear indicator of overfitting to the leaderboard set. By disaggregating leaderboard scores into a real data score and an imputed data score behind the scenes, we were able to model how well participant submissions would generalize without looking at the holdout set.

If you would like to learn more about the problems with leaderboard scores, Moritz Hardt, a member of our Board of Advisors, has a paper on this problem: http://proceedings.mlr.press/v37/blum15.html.

Interpreting your score on the holdout set

You might be pleasantly surprised that your score on the holdout set was better than your score on the leaderboard set, or you might be disappointed that it was worse. Here are a few reasons that they might be different:

Overfitting the leaderboard: One reason that your performance might be worse on the holdout set is overfitting to the leaderboard. If you made many submissions and your submissions seemed to be improving, this might not actually be real progress. We had an earlier post on how the leaderboard can be misleading. Now that the first stage of the Challenge is complete, when you request scores on the holdout set, we will send you a graph of your performance over time on the real outcome data in the leaderboard as well as your performance on the imputed outcome data in the leaderboard. If your performance seems to be improving over time on the imputed outcome data, that is a sign that you have been overfitting to the leaderboard.

For example, consider these two case where the red line shows performance on the imputed leaderboard data and the blue line shows performance on the real leaderboard data.

In the first case, the performance on the real data improved (remember lower is better) and performance on the imputed data did not improve. In the second case, however, performance on the imputed data improves quite a bit, while performance on the real data remains relatively static. In this case, we suspect overfitting to the leaderboard, and we suspect that this person will perform worse on the holdout set than the leaderboard set.

Noisy leaderboard signal: One reason that your score might be better on the holdout set is that the leaderboard set included the randomly imputed outcome data. Your predictions for these cases were probably not very good, and there are no randomly imputed outcome cases in the holdout set (cases with no outcome data in the holdout set are ignored).

Random variation: One reason that your score on the holdout set could be higher or lower is that there are not a huge number of cases in the leaderboard set (530 people) or the holdout set (1,591 people). Also, of these roughly one third are missing outcome data. With this few cases, you should expect that your score will vary some from dataset to dataset.

Conclusion

We hope that this background information about the construction of the training, leaderboard, and holdout sets helps you understand your score. If you have any more questions, please let us know.

Fragile Families Challenge, next steps

Uncategorized No comments

Stage one of the Fragile Families Challenge, the predictive modeling stage, ended today at 2pm ET.  We are grateful to everyone who participated.  This is not, however, the end of the Fragile Families Challenge.  In fact, there are many important and exciting things to come.  We will be:

We are looking forward to all of the next steps in the Fragile Families Challenge.

Fragile Families Challenge Scientific Workshop, Nov 16 & 17

Uncategorized No comments

We are happy to announce the Fragile Families Challenge Scientific Workshop will take place November 16th and 17th (Thursday and Friday) at Princeton University.  The workshop is open to everyone interested in the Challenge, and we will be livesteaming it for people who are not able to travel to Princeton (livestream link, no registration required).

On Thursday, we will meet in Palmer House (map). On Friday, we will meet in Wallace Hall 300 (map).

The schedule will be:

  • Thursday: Workshop of submissions to the special issue of Socius and breakout projects. All are welcome regardless of whether you have written a paper.
  • Friday: Presentations by prize winners. All are welcome to attend, regardless of whether you have won a prize.

If you plan to join us, please complete the registration form.

Thursday, November 16

The first day of the Fragile Families Challenge Scientific Workshop will be devoted to: 1) workshopping papers submitted to the Special Issue of Socius on the Fragile Families Challenge and 2) working on breakout projects.

Workshopping papers

Before the workshop, each participant will be sent 3 papers to read. Participants are expected to read these papers before arriving at the workshop. This will ensure that we have a lively and focused discussion. If you are unable to read the papers ahead of time, please let us know and plan to arrive at lunch time.

Then at the workshop, each paper will be discussed for 45 minutes in a series of parallel roundtable sessions. There will be no presentations by the author because everyone will have read the paper ahead of time. Instead, each session will begin with very brief comments by a pre-assigned moderator, and then there will be a group discussion, which will be facilitated by the moderator. We expect these to be lively, fascinating, and generative discussions.

  • “Black Box Models and Sociological Explanations: Predicting GPA Using Neural Networks” by Thomas Davidson
  • “Humans in the Loop: Priors and Missingness on the Road to Prediction” by Connor Gilroy, Anna Filippova, Ridhi Kashyap, Antje Kirchner, Allison Morgan, Kivan Polimis, Adaner Usmani, and Tong Wang
  • “Privacy, ethics, and high-dimensional social science data: A case study of the Fragile Families Challenge” by Ian Lundberg, Arvind Narayanan, Karen E.C. Levy, and Matthew J. Salganik
  • “Making the analysis of complex survey data more efficient, reliable, and enjoyable: A case study from the Fragile Families Challenge” by Alexander T. Kindel, Kristin Catena, Tom Hartshorne, Kate Jaeger, Dawn Koffman, Sara S. McLanahan, Maya Phillips, Shiva Rouhani, and Matthew J. Salganik
  • “Modeling and Decision Making with Social Systems: Lessons Learned from the Fragile Families Challenge” by Brian Goode, Debanjan Datta, and Naren Ramakrishnan
  • “The Pentlandians ensemble: Winning models for GPA, grit, and layoff in the Fragile Families Challenge” by Daniel Rigobon, Eaman Jahani, Yoshihiko Suhara, Khaled AlGhoneim, Abdulaziz Alghunaim, Alex Pentland, and Abdullah Almaatouq
  • “Predicting material hardship using machine learning” by Erik H. Wang, Diana Stanescu, and Soichiro Yamauchi
  • “The challenges of data science from social science: Using social science knowledge in the Fragile Families Challenge” by Stephen McKay
  • “Predictive features of children GPA in Fragile Families” by Naijia Liu, Hamidreza Omidvar, and Jinjin Zhao
  • “Variable selection and parameter tuning for BART modeling in the Fragile Families Challenge” by James Wu and Nicole Carnegie

Breakout activities

With so many amazing people all in one place, we also wanted to leave time for ideas that you propose, either ahead of time or as a result of the workshopping of the papers. Any participant can propose an idea, and then people can choose which one they want to work on. We’re also going to propose the following projects:

  • Code walkthrough for the new Fragile Families metadata API (lead by Maya Phillips)
  • Demo and testing for new metadata website (lead by Alex Kindel)
  • Testing and improving the Docker container that ensure reproducibility of the special issue (lead by David Liu)
  • Assessing test-retest reliability of concept tags (lead by Kristin Catena)
  • Help us digitize question and answer texts (lead by Tom Hartshorne)

Schedule for Thursday

  • 08:30 – 09:00 Breakfast
  • 09:00 – 09:15 Intro
  • 09:15 – 10:00 Round 1 of papers
  • 10:00 – 10:15 Break
  • 10:15 – 11:00 Round 2 of papers
  • 11:00 – 11:15 Break
  • 11:15 – 12:00 Round 3 of papers
  • 12:00 – 1:00 Lunch
  • 1:00 – 1:30 Discussion of project ideas
  • 1:30 – 5:00 Breakout activities
  • 5:00 – 6:00 Break
  • 6:00 – ??? Dinner

Friday, November 17

The second day of the Fragile Families Challenge Scientific Workshop will be devoted to presentations from the organizers and prize winners.

  • 8:30 – 9:00. Breakfast
  • 9:00 – 9:45. Welcome, Overview of the Fragile Families Challenge
    • Matthew J. Salganik, Professor of Sociology, Princeton University
    • Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study
  • 9:45 – 10:00. Break
  • 10:00 – 11:15. Presentations from progress prize winners and discussion
    • Onur Varol, Postdoctoral Researcher, Center for Complex Network Research, Networks Science Institute, Northeastern University
    • Julia Wang, Princeton University
    • Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK
  • 11:15 – 11:30. Break
  • 11:30 – 12:00. Presentation from foundational prize winner and discussion
    • Gregory Gundersen, PhD Student in Computer Science, Princeton University
  • 12:00 – 1:00. Lunch
  • 1:00 – 2:00. Presentations from innovation prize winners and discussion
    • Nicole Carnegie, Assistant Professor of Statistics, Montana State University
    • Brian J. Goode, Research Scientist, Discovery Analytics Center, Virginia Tech
  • 2:00 – 2:30. Break
  • 2:30 – 4:00 Presentations from final prize winners and discussion
    • Kristin Porter, Senior Associate, MDRC
    • Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi, Ph.D. students, Department of Politics, Princeton University
    • Abdullah Almaatouq (MIT), Eaman Jahani (MIT), Daniel E. Rigobon (MIT), Yoshihiko Suhara (Recruit Institute of Technology and MIT)
  • 4:00 – 4:30. Break
  • 4:30 – 5:00. What’s next
    • Matthew J. Salganik, Professor of Sociology, Princeton University
    • Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study

We will update this page as we have more information. If you have any questions about the Scientific Workshop, please email us.

Getting scores on holdout data

Uncategorized 2 comments

As described in an earlier blog post, there will be a special issue of Socius devoted to the Fragile Families Challenge. We think that the articles in this special issue would benefit from reporting their scores on both the leaderboard data and the holdout data. However, we don’t want to release the holdout data on August 1 because that could lead to non-transparent reporting of results. Therefore, beginning on August 1, we will do a controlled release of the scores on the holdout data. Here’s how it will work:

  • All models for the special issue must be submitted by August 1.
  • Between August 1 and October 1 October 16 you can complete a web form requesting scores on the holdout data for a list of the models. We will send you those scores.
  • You must report all the scores you requested in your manuscript or the supporting online material. We are requiring you to report all the scores that you request in order to prevent selective reporting of especially good results.

We realize that this procedure is a bit cumbersome, but we think that this extra step is worthwhile in order to ensure the most transparent reporting possible of results.

Submit your request for scores here.

Event at the American Sociological Association Meeting

Uncategorized No comments

We are happy to announce that there will be a Fragile Families Challenge event Sunday, August 13 at 2pm at the American Sociological Association Annual Meeting in Montreal. We will gather at the Fragile Families and Child Wellbeing Study table in the Exhibit Hall (220c). We are the booth in the back right (booth 925). This will be a great chance to meet other participants, share experiences, and learn more about the next stages of the mass collaboration and the Fragile Families study more generally. See you in Montreal!

A Data Pipeline for the Fragile Families Challenge

Uncategorized 1 comment

Guest blog post by Anna Filippova, Connor Gilroy, and Antje Kirchner

In this post, we discuss the challenges of preparing the Fragile Families data for modeling, as well as the rationales for the methods we chose to address them. Our code is open source, and we hope other Challenge participants find it a helpful starting point.

If you want to dive straight into the code, start with the vignette here.

Data processing

The people who collect and maintain the Fragile Families data have years of expertise in understanding the data set. As participants in the Fragile Families Challenge, we had to use simplifying heuristics to get a grasp on the data quickly, and to transform as much of it as possible into a form suitable for modeling.

A critical step is to identify different variables types, or levels of measurement. This matters because most statistical modeling software transforms categorical covariates into a series of k – 1 binary variables, while leaving continuous variables untransformed. Because categorical variables are stored as integers, with associated strings as labels, a researcher could just use those integers directly in a model instead—but there is no guarantee that they would be substantively meaningful. For interpretation, and potentially for predictive performance, accounting for variable type is important.

This seems like a straightforward problem. After all, it is typically clear whether a given variable is categorical or continuous from the description in the codebook. With a handful of variables, classifying them manually is a trivial task, but this is impossible with over 12,000 variables. An automated solution that works well for the majority of variables is to leverage properties of the Stata labels, using haven, to convert each variable into the appropriate R class—factor for categorical variables, numeric for continuous. We previously released the results of this work as metadata, and here we put it to use.

A second problem similarly arises from the large number of variables in the Fragile Families data.  While some machine learning models can deal with many more parameters than observations (p >> n), or with high amounts of collinearity among covariates, most imputation and modeling methods run faster and more successfully with fewer covariates. Particularly when demonstrating or experimenting with different modeling approaches, it’s best to start out with a smaller set of variables. If the constructed variables represent expert researchers’ best attempts to summarize, consolidate, and standardize survey responses across waves, then those variables make a logical starting point. Fortunately, most of these variables can be identified with a simple regular expression.

Finally, to prepare for imputation, Stata-style missing values (labelled negative numbers) need to be converted to R-style NAs.

Missing data

Data may be missing in a (panel) study for many reasons, including respondent’s unwillingness to answer a question, a don’t know response, skip logic (for questions that do not apply to a given respondent), and panel attrition (for example, due to locating difficulties for families). Additional missing data might be due to data entry errors and—particularly relevant for the challenge—anonymization to protect sensitive information of members of a particularly vulnerable population.

What makes missing data such a challenge for computational approaches? Many statistical algorithms operate on complete data, often obtained through listwise deletion of cases. This effectively assumes that instances are missing completely at random. The Fragile Families data are not missing completely at random; moreover, the sheer amount of missingness would leave few cases remaining after listwise deletion. We would expect a naive approach to missingness to significantly reduce the predictive power of any statistical model.

Therefore, a better approach is to impute the missing data, that is, make a reasonable guess about what the missing values could have been. However, current approaches to data imputation have some limitations in the context of the Fragile Families data:

  • Standard packages like Amelia perform multiple imputation from a multivariate normal distribution, hence they are unable to work on the full set of 12,000 covariates with only 4,000 observations This is also computationally intensive, taking several hours to run even when using a regularizing prior, a subset of variables, and running individual imputations in parallel.
  • Another promising approach would be to use Full Information Maximum Likelihood estimation. FIML estimation models sparse data without the need for imputation, thus offering better performance. However, no open-source implementation for predictive modeling with FIML exists at present.
  • We could also use the existing structure of the data to make logical edits. For instance, if we know a mother’s age in one wave, we can extrapolate this to subsequent waves if those values are missing. Carrying this idea a step further, we can make simple model-based inferences; if, for example, a father’s age is missing entirely, we can impute this from the distribution of differences between mother’s and father’s ages. This process, however, requires treating each variable individually.

To address some of these issues, our approach to missing data considers each variable in the data-set in isolation (for example cm1hhinc, mother’s reported household income at wave 1), and attempts to automatically identify other variables in the data-set that may be strongly associated with this variable (such as cm2hhinc, mother’s reported household income at wave 2 and cf1hhinc, father’s reported household income at wave 1). Assembling a set of 3 to 5 of such associations per variable allows us to construct a simple multiple-regression model to predict the possible value of the missing data for each column (variable) of interest.

Our approach draws on two forms of multiple-regression models, a simple linear ordinary-least squares regression, and a linear regression with lasso penalization. To evaluate their performance, we compare our approach to two alternative forms of imputation: a naive mean-based imputation, and imputation using the Amelia package. Holding constant the method we use to make predictions and the variables used, our regression-based approach outperforms mean imputation on the 3 categorical outcome variables: Eviction, Layoff, and Job Training. The Lasso imputation also outperforms Amelia on these variables, but the unpenalized regression imputation has mixed effects. Interestingly, mean imputation performs the best for GPA and Grit, and we saw a similar performance on Material Hardship using mean imputation, Amelia, and linear regression, but Lasso was significantly worse than the former approaches. Overall, even simple mean imputation performed better than using Amelia on this dataset.

The approach we used comes with a number of assumptions:

  1. We assume that the best predictors of any given variable already exist in the Fragile Families dataset, and do not need significant processing. This is not an unreasonable assumption, as many variables in the dataset are collected across different waves, thus there may be predictable relationships between each wave.
  2. Our tests above assume a  linear relationship between predictor variables and the variable we impute, although our code has an option to also take into account polynomial effects (the ‘degree’ option available when using method=’lasso’).
  3. To get complete predictions for all 4000 cases using the regression models, we needed to first impute means of the covariates used for the imputation. In other words, in order to fill in missing data, we paradoxically needed to first fill in missing data. FIML is one solution to this challenge, and we hope to see this make its way into predictive modelling approaches in languages like R or Python.

Our pipeline

We modularized our work into two separate repositories, following the division of labor described above.

For general data processing, ffc-data-processing, which

  1. Works from the background.dta Stata file to extract covariate information.
  2. Provides helper functions for relatively fast data transformation.

For missing data imputation, FFCRegressionImputation, which

  1. Prepares the raw background.csv data and performs a logical imputation of age-related variables as we describe above.
  2. Constructs a (correlation) matrix of strengths of relationships between a set of variables.
  3. Uses the matrix to perform a regression-based prediction to impute the likely value of a missing entry.

For a technical overview of how these two bodies of code integrate with each other, check out the integration vignette. The vignette is an RMarkdown file which can be run as-is or freely modified.

The code in the vignette subsets to constructed variables, identifies those variables as either categorical or continuous, and then only imputes missing values for the continuous variables, using regression-based imputation. We chose to restrict the variables imputed for illustrative purposes, and to improve the runtime of the vignette. Users of the code can and should employ some sort of imputation strategy—regression-based or otherwise—for the categorical variables before incorporating the covariates into a predictive model.

Reflections

What seemed at the beginning to be a straightforward precursor to building predictive models turned out to have complexities and challenges of its own!

From our collaboration with others, it emerged that researchers from different fields perceive data problems very differently. A problem that might not seem important to a machine-learning researcher might strike a survey methodologist as critical to address. This kind of cross-disciplinary communication about expectations and challenges was productive and eye-opening.

In addition, the three of us came into this project with very different skillsets. We settled on R as a lingua franca, but drew on a much broader set of tools and techniques to tackle the problems posed by the Fragile Families Challenge. We would encourage researchers to explore all the programming tools at their disposal, from Stata to Python and beyond.

Finally, linking everyone’s efforts together into a single working pipeline that can be run end-to-end was a significant step by itself. Even with close communication, it took a great deal of creativity as well as clarity about desired inputs and outputs.

We hope that other participants in the Fragile Families Challenge find our tools and recommendations useful. We look forward to seeing how you can build on them!

Helpful idea: Compare to the baseline

Uncategorized No comments

Participants often ask us if their scores on the leaderboard are “good”. One way to answer that question is with a comparison to the baseline model.

In the course of discussing how a very simple model could beat a more complex model, this post will also discuss the concept of overfitting to the training data and how this could harm predictive performance.

What is the baseline model?

We have introduced a baseline model to the leaderboard, with the username “baseline.” Our baseline prediction file simply takes the mean of each outcome in the training data, and predicts that mean value for all observations. We provided this file as “prediction.csv” in the original data folder sent to all participants.

How is the baseline model performing?

As of the writing of this post (12:30pm EDT on 15 July 2017), the baseline model ranks as follows, with 1 being the best score:

  • 70 / 170 unique scores for GPA
  • 37 / 128 for grit
  • 60 / 99 for material hardship
  • 37 / 96 for eviction
  • 32 / 85 for layoff
  • 30 / 87 for job training

In all cases except for material hardship, the baseline model is in the top half of scores!

A quick way to evaluate the performance of your model is to see the extent to which it improves over the baseline score.

How can the baseline do so well?

How can a model with no predictors outperform a model with predictors? One source of this conundrum is the problem of overfitting.

As the complexity of a model increases, the model becomes more able to fit the idiosyncracies of the training data. If these idiosyncracies represent something true about the world, then the more complex fit might also create better predictions in the test data.

However, at some point, a complex model will begin to pick up random noise in the training data. This will reduce prediction error in the training sample, but can make predictions worse in the test sample!


Note: Figure inspired by Figure 7.1 in The Elements of Statistical Learning by Hastie, Tibshirani, and Freedman, which provides a more thorough overview of the problem of overfitting and the bias-variance tradeoff.

How can this be? A classical result in statistics shows that the mean squared prediction error can be decomposed into the bias squared plus the variance. Thus, even if additional predictors reduce the bias in predictions, they can harm predictive performance if they substantially increase the variance of predictions by incorporating random noise.

What can be done?

We have been surprised at how a seemingly small number of variables can yield problems of overfitting in the Fragile Families Challenge. A few ways to combat this problem are:

  • Choose a small number of predictors carefully based on theory
  • Use a penalized regression approach such as LASSO or ridge regression.
    • For an intuitive introduction to these approaches, see Efron and Hastie Computer Age Statistical Inference [book site], sections 7.3 and 16.2.
    • The glmnet package in R [link] is an easy-to-use implementation of these methods. Many other software options are also available.
  • Use cross-validation to estimate your model’s generalization error within the training set. For an introduction, see chapter 12 of Efron and Hastie [book site]

But at minimum, compare yourself to the baseline to make sure you are doing better than a naive prediction of the mean!

Metadata about variables

Uncategorized No comments

We are happy to announce that Challenge participant Connor Gilroy, a Ph.D. student in Sociology at the University of Washington, has created a new resource that should make working the Challenge data more efficient. More specifically, he created a csv file that identifies each variable in the Challenge data file as either categorical, continuous, or unknown. Connor has also open sourced the code that he used to create the csv file. We’ve had many requests for such a file, and Connor is happy to share his work with everyone! If you want to check and improve Connor’s work, please consult the official Fragile Families and Child Wellbeing Study documentation.

Connor’s resource is part of a tradition during the Challenge whereby people have open sourced resources to make the Challenge easier for others. Other resources include:

If you have something that you’d like to open source, please let us know.

Finally, Connor work was part of a larger team project at the Summer Institute in Computational Social Science to build a full data processing pipeline for the Fragile Families Challenge. Stay tuned for that blog post on Tuesday, July 18!

Call for papers, special issue of Socius about the Fragile Families Challenge

Uncategorized No comments

Socius Call for Papers
Special issue on the Fragile Families Challenge
Guest editors: Matthew J. Salganik and Sara McLanahan

Socius, an open access journal published by the American Sociological Association, will publish a special issue on the predictive modeling phase of the Fragile Families Challenge. All participants in the Fragile Families Challenge are invited to submit a manuscript to this special issue.

A strong manuscript for the special issue will describe the process of creating a submission to the Challenge and will describe what was learned during that process. For example, a strong manuscript will describe the different approaches that were considered for data preprocessing, variable selection, missing data, model selection, and any other steps involved in creating the final submission to the Challenge. Further, a strong manuscript will also describe how the authors decided among the many possible approaches. Finally, some manuscripts may seek to draw more general lessons about social inequality, families, the common task method, social science, data science, or computational social science. Manuscript should be written in a style that is accessible to a general scientific audience.

The editors of the special issue may also consider other types of manuscripts that are consistent with the scientific goals of the Fragile Families Challenge. If you are considering submission a manuscript different from what is described above, please contact the editors of the special issue at fragilefamilieschallenge@gmail.com before submitting your manuscript.

All papers will be peer reviewed, and publication is not guaranteed. However, there is no limit on the number of articles that will be accepted in the special issue. All published papers must abide by the terms and conditions of the Fragile Families Challenge, and must be accompanied by open source code and a data file containing predictions.

Submissions for the special issue must be received through the Socius online submission platform by Sunday, October 1, 2017 Monday, October 16 at 11:59pm ET. If you have any questions about the special issue, please email fragilefamilieschallenge@gmail.com.

FAQ:

  • Do I need to describe an approach to predicting all six outcome variables in order to submit to the special issue?
  • No. We will happily consider papers that focus on one specific outcome variable.

  • Do I need to have a low mean-squared error in order for my paper to be published?
  • No. Predictive performance in the held-out dataset is only part of what we will consider. For example, a paper that clearly shows that many common strategies were not very effective would be considered a valuable contribution.

  • What if I can’t afford the Article Processing Charge?
  • Socius, like most open access journals, has an Article Processing Charge. This charge is required to keep Socius running, and it is in line with the charges at other open access journals. However, we strongly believe that the Article Processing Charge should not be a barrier to scientific participation. Therefore, the Fragile Families Challenge project will pay the Article Processing Charge for all accepted articles submitted by everyone except for tenure-track (or equivalent) faculty working in universities in OECD countries. In other words, we will cover the Article Processing Charge for undergraduates, graduate students, post-docs, and people working outside of universities. Further, we will pay the Article Processing Charge for all tenure-track (or equivalent) faculty working in universities outside the OECD.

    If for any reason you think that the Article Processing Charge may be a barrier to your participation, please send us an email and we will try to find a solution: fragilefamilieschallenge@gmail.com.

  • How will you decide what manuscripts to accept for publication?
  • Articles in Socius are judged by four criteria: Accuracy, Novelty, Interest, and Presentation. In the case of this special issue, these criteria will be judged by the editors of the special issue, with feedback from reviewers and the editors of Socius. For the purposes of this special issue, here is how these criteria will be interpreted:

    • Accuracy: The key question is whether this analysis was conducted appropriately and accurately. Were the techniques used in the manuscript performed and interpreted correctly? Do the claims in the manuscript match the evidence provided?
    • Novelty: The key question is whether the manuscript will be novel to some social scientists or some data scientists. Because projects like the Fragile Families Challenge are not yet common, we expect that most submitted manuscripts will be somewhat novel.
    • Interest: The key question for the editors is whether the manuscript will be interesting to some social scientists or some data scientists. Will some people want to read this paper? Does it advance understanding of the Fragile Families Challenge and related intellectual domains?
    • Presentation: The key question is whether this manuscript communicates effectively to a diverse audience of social scientist and data scientists. We will also assess whether the figures and tables are presented clearly and whether the manuscript makes appropriate use of the opportunity for supporting online information. Because these manuscripts will be short, we expect that the supporting online information will play a key role.

  • Who is the audience for these papers?
  • All papers should be written for a general scientific audience that will include both social scientists and data scientists (broadly defined). In other words, when writing your paper you should imagine an audience similar to the audience at journals such as Science and Proceedings of the National Academies of Sciences (PNAS). We would recommend reading some articles from these journals to get a sense of this style. Manuscripts that use excessive jargon from a specific field will be asked to make revisions.

    Manuscripts should follow the length guidelines of a Report published in Science: 2,500 words, with up to 4 figures or tables. Additional materials should be included in supporting online materials. We will consider articles that deviate from these guidelines in some situations. Other aspects of the manuscript format will follow standard Socius rules.

  • Should we describe the Fragile Families Challenge in our paper?
  • No. There is no need to describe the Challenge in your paper. The special issue will have an introductory article describing the Challenge and data. You should assume that your readers will already have this background information.

  • Will the articles go through peer review?
  • Absolutely. All manuscripts will be reviewed by at least two people. Possible reviewers include: members of the board of the Fragile Families Challenge, qualified participants in the Challenge, members of the general reviewer pool at Socius, and other qualified researchers.

  • What are the requirements for the open source code?
  • The code must take the Fragile Families Challenge data files as an input and produce (1) all the figures and tables in your manuscript and supporting online materials and (2) your final predictions. The code can be written in any language (e.g., R, stata, Python). The code should be released under the MIT license, but we will consider other permissive licenses in special situations.

  • How long will the review process take?
  • We don’t know exactly, but we are excited about having these results in the scientific literature as quickly as possible. Therefore, we will work as quickly as possible while maintaining the quality standards of the Fragile Families Challenge and Socius.

  • Will I have access to the holdout data when writing my paper? (added July 20, 2017)
  • No, but we will allow you to request scores for your models on the holdout as described in this blog post.

  • Will I have access to the Challenge data when writing my paper? (added July 27, 2017)
  • Yes. If you will submit to the Special Issue you can continue to use the Challenge data until the Special Issue is published. If you are not submitting to the Special Issue, then you should delete the Challenge data file on August 1. Finally, participants who want to continue to do non-Challenge related research with the Fragile Families and Child Wellbeing Study can, at any time, apply for access to the core Fragile Families data by following the instructions here: http://www.fragilefamilies.princeton.edu/documentation.

  • What if I want to submit to the special issue but I can’t exactly reproduce my submission to the Challenge? (added September 23, 2017)
  • Everyone in the Challenge was supposed to uploaded their code. But there are several reasons why they might not be able to use their code to reproduce their submission such as forgetting to set their seed or changes to the packages that were used in the submission (if you are interested, here are some general tips for promoting reproducibility Sandve et al (2013) “Ten Simple Rules for Reproducible Computational Research.”)

    If there is a tension between making your paper reproducible and making it match the submission to the Challenge exactly, you should opt to make your paper reproducible. If the code and predictions that you submit with your paper don’t exactly match what you submitted to the Challenge, you should include a note in your supporting online material explaining these differences and why they occurred. If this note will require addition information from us—such as the score of your reproducible results in the leaderboard data—we will provide it to you. We are happy to help you with these issues on a case-by-case basis.

  • I have another question, how can I ask it?
  • Send us an email: fragilefamilieschallenge@gmail.com.

Helpful idea: Read prior research

Uncategorized No comments
featured image

Not an expert in child development, poverty, or family sociology? Participants often wonder how they can contribute if they have no prior knowledge of these fields. Luckily, there are a few resources to bring you up to speed quickly!

Fact sheet

The Fragile Families and Child Wellbeing Study (FFCWS) Fact Sheet can quickly introduce the key findings from the broader FFCWS. For instance, the study discovered that “single” parenthood is a bit of a misnomer; about half of the unmarried parents in the sample were actually living together when the child was born! Yet many of these couples subsequently separated.

Research briefs

Looking for mored detailed information on a particular subfield? The Fragile Families Research Briefs provide accessible summaries of cutting edge research using the data.

Publication collection

Want to know how social scientists are using the data right now? The Fragile Families publication collection lists hundreds of published articles and working papers using the Fragile Families and Child Wellbeing Study. If you want to see how social scientists have used the data and get ideas for variables you may want to include in your models, the publication collection is a good place to start.

Other publications

A more exhaustive list of published resources is available here.

Helpful ideas series

This is the first in a series of blog posts with helpful ideas to help you build better models – look for more to come soon! For email notifications when we make new posts, subscribe in the box at the top right of this page.

Getting started quickly in the Fragile Families Challenge

Uncategorized No comments
featured image

Want to build your first submission to the Fragile Families Challenge in an hour? In this post, we’ll tell you the trick to getting started quickly: the constructed variables.

If you’ve never worked with the Fragile Families data before it can seem daunting. The background file contains 12,943 variables (columns) for 4,242 children (rows), but 56% of the cells in this matrix are missing! Participants often begin by trying to read all the documentation, clean all of the variables, and impute reasonable values for the missing cells. This quickly becomes demoralizing. What else can you do?

Our overall recommendation is to begin with the constructed variables. These 600 variables were “constructed” by the Fragile Families research staff in order to help future researchers, and they were constructed based on multiple reports in order to reduce missing data. For example, the variable cm1relf consolidates the key information from 5 questions asked of the mother about her relationship with the father at the birth of the child. The constructed variables are a great place to start because they:

  • represent constructs social scientists believe to be important
  • have very little missing data
  • are easy to identify because they begin with the letter c (i.e. cm1ethrace is constructed wave 1 mother’s ethnicity and race)
    • There are a small number of exceptions to this convention. For instance, the variable t5tint is a constructed variable indicating whether the teacher was interviewed in wave 5. However, the vast majority of constructed variables begin with c.
    • When we say that constructed variables have little missing data, this statement is restricted to constructed variables that have some data all. In other words, there are some constructed variables are all NA in the Challenge file (e.g., cm1tdiff).

These constructed variables are more fully documented on p. 13-20 of the general study documentation. Further, they are also summarized in this participant-generated open-source dictionary.

A good strategy to get started quickly is to pick some constructed variables, build a very simple model, and get yourself on the leaderboard! You can always build up from there. Participants often begin with cm1ethrace, cf1ethrace, cm1edu, cf1edu, and cm1relf.

Even if you start with the constructed variables, you will be frustrated by missing data. As summarized in our blog post, there is no perfect solution to this problem. We recommend the following workflow:

  1. Start with a small fraction of the total variables. Focus on imputing the missing values for this subset, rather than for all variables in the entire file.
  2. Decide how to address informative missing values (i.e. -6, valid skip). For categorical variables, you might treat valid skips as their own category.
  3. Impute remaining missing values with mean or median imputation. We know that mean or median imputation aren’t great, but they are a reasonable starting point, and you can move to model-based imputation later.
  4. Fit models on your imputed dataset.

Constructed variables – data dictionary

Uncategorized No comments
featured image

We are happy to announce that Challenge participants Aarshay Jain, Bindia Kalra, and Keerti Agrawal at Columbia University have created a new resource that should make working the Challenge data more efficient. More specifically, they created an alternative data dictionary for the constructed variables (FFC_Data_Dictionary.xlsx). They have made it available open-source here.

Their dictionary:

  • Summarizes constructed variable prefixes and suffixes
  • Categorizes questions by the respondent to and subject of the question
  • Provides examples of questions from a variety of substantive categories

As discussed in our blog post on getting started quickly, the constructed variables are a good place to start when choosing variables to include in your model. These variables are summarized on p. 13-20 of the general study documentation.

The official Fragile Families and Child Wellbeing Study site is still the authoritative source of documentation, but we hope this open source contribution helps you more quickly understand the variables available and how to find them.

The open-source movement is exciting because it unlocks the power of what we can do by collaboration. Much like a Wikipedia page benefits when hundreds of people view it and think about improvements they could make, so too will the open-source resources for the Fragile Families Challenge shine if others get involved when they think of possible improvements. If you think you can make this data dictionary better, please jump in, open-source your new version, and let us know so we can publicize it! In fact, Aarshay, Bindia, and Keerti would love to see these kind of improvements. Likewise, we welcome any other open-source contributions that you think might make the Challenge better.

Many thanks to Aarshay, Bindia, and Keerti for making it easier for others to use the data!

getting started workshop, Princeton and livestream

Uncategorized No comments

We will be hosting a getting started workshop at Princeton on Friday, June 23rd from 10:30am to 4pm. This workshop will also be livestreamed at this link so even if you can’t make it to Princeton you can still participate.

During the workshop we will

  • Provide a 45 minute introduction to the Challenge and the data (slides)
  • Provide food and a friendly collaborative environment
  • Work together to produce your first submission

In addition to people just getting started, we think the workshop will be helpful for people who have already been working on the Challenge and who want to improve their submission. We will be there to answer questions both in person and through Google Hangouts during the entire event.

Logistics:

  • When: Friday, June 23rd from 10:30 to 4pm ET
  • Where: Julis Romo Rabinowitz Building, Room 399 and streaming here
  • RSVP: If you have not already applied to the Challenge, please mention the getting started workshop in your application. If you have already applied, please let us know that you plan to attend (fragilefamilieschallenge@gmail.com). We are going to provide lunch for all participants, and we need to know how much food to order.
  • This getting started workshop will be a part of the Summer Institute for Computational Social Science.

Getting started with Stata

Uncategorized No comments
featured image

This post summarizes how to work on the Fragile Families Challenge data in Stata.

We only cover the basics here. For more detailed example code, see our open-source repository, thanks to Jeremy Freese.

How do I import the data?

Before loading the data, you may need to increase the number of variables Stata will hold.
set maxvar 13000

Then, change your working directory to the place where the file is located, using
cd your_directory.

Load the training outcomes
import delimited train.csv, clear case(preserve) numericcols(_all)
Two options there are critical:

  • The case(preserve) option ensures that the case of variable names is preserved. Omitting this option will produce errors in your submission since capitalization in variable names is required (i.e. challengeID), but Stata’s default makes all variable names lower case.
  • The numericcols(_all) option ensures that the outcomes are read as numeric,
    rather than as character strings.

Merge the background variables to that file using the challengeID identifier.
merge 1:1 challengeID using background.dta

  • You will see that 2,121 observations were in both datasets. These are the training observations for which we are providing the age 15 outcomes.
  • You will also see that 2,121 observations were only in the using file, since the background variables but not the outcomes are available for these cases. These are the test cases on which your predictions will be evaluated.

If you have an older version of Stata, you may not be able to open the .dta file with metadata. You can still load the background file from the .csv format. To do that, you should first load the .csv file and save it in a .dta format you can use. Then, follow the instructions above.
import delimited background.csv, clear case(preserve)
save background.dta, replace

Again, note the important case(preserve) option!

How do I make predictions?

If your model is linear or logistic regression, then you can use the predict function.
regress gpa your_predictors
predict pred_gpa, replace

Then the variable gpa_pred has your predictions for GPA. You can do this for all 6 outcomes.

How do I export my submission?

This section assumes your predicted values are named pred_gpa, pred_grit, etc. First, select only the identifier and the predictions.
keep challengeID pred_*
Then, rename all your predictions to not have the prefix pred_
local outcomes gpa grit materialHardship eviction layoff jobTraining
foreach outcome of local outcomes {
rename pred_`outcome' `outcome'
}

Finally, export the prediction file as a .csv.
export delimited using prediction.csv, replace
Finally, bundle this with your code and narrative description as described in the blog post on uploading your contribution!

Stata .dta file with metadata

Uncategorized No comments
featured image

In response to many requests from Challenge participants, we are now able to provide a .dta file in Stata 14 format. This file contains metadata which we hope will help participants to find variables of interest more easily.

Contents of the .dta file

If you have been working with our background.csv file and the codebooks available at fragilefamilies.princeton.edu, then this .dta file provides the same information you already had, but in a new format.

  • Each variable has an associated label which contains a truncated version of the survey question text.
  • For each categorical variable, the text meaning of each numeric level of that variable is recorded with a value label.

You are welcome to build models from the .csv file or from the .dta file.

Distribution of the .dta file

All new applicants to the Challenge will receive a zipped folder containing both background.csv and background.dta.

Anyone who received the data on or before May 24, 2017 may send an email to fragilefamilieschallenge@gmail.com to request a new version of the data file.

Using the .dta file

Stata users can easily load the .dta file, which is in Stata format.

We have prepared a blog post about using the .dta file in R and about using the .dta file in Python to facilitate use of the file in these other software packages.

We hope the metadata in this file enables everyone to build better models more easily!

Using .dta files in R

Uncategorized No comments
featured image

We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.

There are many ways to read .dta files into R. In this post I’ll use haven because it is part of the tidyverse.

Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):

library(tidyverse)
library(haven)
ffc.stata <- read_dta(file = "background.dta")
ffc.csv <- read_csv(file = "background.csv")

One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.

Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.

is.na(ffc.stata[(ffc.stata$challengeid==1104), "m1b9b11"])
is.na(ffc.csv[(ffc.csv$challengeID==1104), "m1b9b11"])

If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.

Notes:

  • The read_dta function in haven is a wrapper around the ReadStat C library.
  • The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
  • Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.

Using .dta files in Python

Uncategorized No comments
featured image

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.

Notes

  • Documentation for pandas.read_stata() is available here.
  • The read_stata() function accepts either a file path or a read buffer (as above).
  • The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

Machine-Readable Fragile Families Codebook

Uncategorized No comments

The Fragile Families and Child Wellbeing study has been running for more than 15 years. As such, it has produced an incredibly rich and complex set of documentation and codebooks. Much of this documentation was designed to be “human readable,” but, over the course of the Fragile Families Challenge, we have had several requests for a more “machine-readable” version of the documentation. Therefore, we are happy to announce that Greg Gundersen, a student in Princeton’s COS 424 (Barbara Engelhardt’s undergraduate machine learning class), has created a machine-readable version of the Fragile Families codebook in the form of a web API. We believe that this new form of documentation will make it possible for researchers to work with the data in unexpected and exciting ways.

There are three ways that you can interact with the documentation through this API.

First, you can search for words inside of question description field. For example, imagine that you are looking for all the questions that include the word “evicted”. You can find them by visiting this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/?q=evicted

Just put your search term after the “q” in the above URL.

The second main way that you can interact with the new documentation is by looking up the question data associated with a variable name. For example, want to know what is “cm2relf”? Just visit:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2relf

Finally, if you just want all of the questionnaire data, visit this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/

A main benefit of a web API is that researchers can now interact with the codebooks programmatically through URLs. For example, here is a snippet of Python 2 code that fetches the data for question “cm2mint'”:

>>> import urllib2
>>> import json
>>> response = urllib2.urlopen('https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2mint')
>>> data = json.load(response)
>>> data
[{u'source file': u'http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb1.txt', u'code': u'cm2mint', u'description': u'Constructed - Was mother interviewed at 1-year follow-up?', u'missing': u'0/4898', u'label': u'YESNO8_mw2', u'range': [0, 1], u'unique values': 2, u'units': u'1', u'type': u'numeric (byte)'}]

We are very grateful to Greg for creating this new documentation and sharing it with everyone.

Notes:

  • Greg has open sourced all his code, so you can help us improve the codebook. For example, someone could write a nice front-end so that you can do more than just interact via the url.
  • The machine-readable documentation should include the following fields: description, source file, type, label, range, units, unique values, missing. If you develop code that can parse some of the missing fields, please let us know, and we can integrate your work into API.
  • The machine-readable documentation includes all the documentation that was in text files (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_dad_cb5.txt). It does not include documentation that was in pdf format (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_hv_cb5.pdf).
  • When you visit these urls, what gets returned is in JSON format, and different browsers render this JSON differently.
  • If there is a discrepancy between the machine-readable codebook and the traditional codebook, please let us know.
  • To deploy this service we used Flask, which is an open source project. Thank you to the Flask community.

Final submission deadline

Uncategorized No comments
featured image

The final submission deadline for the Fragile Families Challenge will be
2pm Eastern Daylight Time on Tuesday, August 1, 2017.

While it is tempting to stay open indefinitely to continue collecting high-quality submissions, closing is important so that we can conduct the targeted interviews within a reasonable timespan after the original interview, and so that the Fragile Families and Child Wellbeing Study can make the full data available to researchers.

How much should I trust the leaderboard?

Uncategorized No comments
featured image

The leaderboard on the Fragile Families Challenge submission site is often the first thing participants focus on. It is therefore important to understand!

Why do we like the leaderboard?

The leaderboard:

  • shows rankings in real-time, motivating better submissions
  • demonstrates that models that predict well in the training data do not necessarily perform well in an out-of-sample test
  • makes the Challenge more fun!

Understanding the data split

However, the leaderboard is only a small portion of the overall data. In fact, the observations (rows) in the data are split into:

  • 4/8 training data
  • 1/8 leaderboard data
  • 3/8 test data

As discussed in our blog post on evaluating submissions, final evaluation will be done on a separate set of held-out test data – the 3/8 portion referenced above. This means all awards (including the progress prizes) will be conducted on the test data, not the leaderboard. Likewise, our follow-up interviews will focus on the test set observations that were not used for training. Separation between the leaderboard and test sets is important; the leaderboard set isn’t truly held out since everyone receives repeated feedback from this set throughout the challenge!

Implications for strategy

What does this mean for your ideal strategy? How can you best make use of the leaderboard?

  • The leaderboard gives an instant snapshot of your out-of-sample performance. This can be useful in evaluating your model, much as splitting your own training set can be helpful.
  • However, over-fitting to the leaderboard will only hurt your score in the final test set evaluation
  • Leaderboard scores are noisy measures of generalization error because they are based on a small sample. So, even as a measure of generalization error, the leaderboard should be interpreted cautiously!

In summary, we expect some models to perform better in the final evaluation than the leaderboard suggests, due to random noise. Likewise, some models will look good on the leaderboard but perform poorly in the final evaluation because they got lucky in the leaderboard. Some submissions may even under-perform in the final evaluation because they made too many modeling adjustments to fit closely to idiosyncrasies of the leaderboard!

Your final evaluation will not be based on the leaderboard, so you are best advised to use it cautiously as one (noisy) bit of information about your generalization error.

Progress report: COS424 @ Princeton

Uncategorized 1 comment
featured image

As we near the midpoint of the Challenge, we are excited to report on the progress of our first cluster of participants: student teams in COS424, the machine learning fundamentals course at Princeton. You can find some schematic analyses of their performance over time, modeling strategies, and more here. Some of the students have open-sourced their code for all participants to use and learn from; you can find that code here.

Thanks to all the COS424 students for their awesome contributions!

Progress prizes

Uncategorized 11 comments
featured image

We were glad to receive many submissions in time for the progress prizes! As described below, we have downloaded these submissions and look forward to evaluating them and determining the best submissions at the end of the Challenge.

We are excited to announce that progress prizes will be given based on the best-performing models on Wednesday May 10, 2017 at 2pm Eastern time. We will not announce the winners, however, until after the Challenge is complete.

Here’s how it will work. On May 10, 2017 at 2pm Eastern time, we will download all the submissions on the leaderboard. However, we will not calculate which submission has the lowest error on the held-out test data until after the Challenge is complete. The reason for this delay is that we don’t want to reveal any information at all about the held-out test data until after the Challenge is over.

From the submissions that we have received by May 10, 2017 at 2pm Eastern Time, we will pick the ones that have the lowest mean-squared error on the held-out test data for each of the six outcome variables. In other words, there will be one prize for the submission that performs best for grit, and there will be another prize for the submission that performs best for grade point average, and so on.

All prize winners will be invited to participate in the post-Challenge scientific workshop at Princeton University, and we will cover all travel expenses for invited participants. If the prize-winning submission is created by a team, we will cover all travel expenses for one representative from that team.

We look forward to seeing the submissions.

Getting started workshop at PAA

Uncategorized No comments
featured image

The Fragile Families Challenge is excited to host a getting started workshop at the Annual Meeting of the Population Association of America in Chicago!

We will

  • Present a few slides introducing the Challenge (SLIDES HERE)
  • Provide food and a friendly collaborative environment
  • Work together to produce your first submission!

When: 10am – 2pm, Thursday, April 27
Where: Hilton Chicago, Conference Room 4G (DIRECTIONS: Come to the 4th floor and we’re the room way down at the end.)
Who: You! Anyone involved in social science and/or data science can make an important contribution.
RSVP: Mention you’re coming to our PAA workshop when you apply to participate!

We hope to see you there!

Code Validation

Uncategorized No comments
featured image

As part of the challenge, we’re interested in understanding and learning from the strategies participants are using to predict outcomes in the Fragile Families data. One major goal of the challenge is to learn how these strategies evolve and develop over time. We think that a more systematic understanding of how social scientists and data scientists think with data has the potential to better inform how statistical analysis is done. To do this analysis, we use the code and narrative analysis included with each submission.

Recently, we updated the code that evaluates predictions to ensure that groups don’t forget to include their code in their submissions.

What does this mean for me?

  • Make sure that your directory contains all of the code you used to generate your predictions.
  • It’s not a problem if the code is in multiple/un-executable scripts. When we look over code submissions, we don’t execute the code.
  • If you run into an error when you submit your predictions that says you’ve forgotten your code, but your submission does actually contain the code you’ve been using, let us know as soon as possible!

Reading survey documentation

Uncategorized No comments
featured image

The Fragile Families survey documentation can be confusing. We’ve put together this blog post so you can find out what variables in the Challenge data file mean.

Using the Fragile Families website

The first place to go to find out what a given variable represents is the Fragile Families and Child Wellbeing Study website: http://www.fragilefamilies.princeton.edu/

Once there, click the “Data and Documentation” tab.

This brings you to the main documentation for the full study. On the left, you will see a set of links that will take you to the documentation for particular waves of the data.

Clicking on the link for Year 9 (Wave 5) as an example, we see the following page of documentation for this survey.

Let’s look at the mother questionnaire and codebook. On page 5 of the questionnaire, you will see the following question:

In the corresponding codebook, we see the count of respondents who gave each answer:

Two things are worth noting here.

  1. The question referred to in the questionnaire as A3B is called m5a3b in the codebook. This is because the prefix “m5” indicates that this question comes from the mother wave 5 interview.
  2. Lot’s of people got coded -6 for “Skip.” Looking back at the questionnaire, we can see why they were skipped over this question: it was only asked of those for whom “PCG = NONPARENT AND RELATIONSHIP = FOSTER CARE.” For children not in foster care, this question would not be meaningful, so it wasn’t asked.

In general, the questionnaires are the best source for information about why certain respondents get skipped over questions. For more information on all the ways data can be missing, see our blog post on missing data.

Structure of the variable names

The general structure of the variable names is [prefix for questionnaire type][wave number][question number].

What are all the variable prefixes?

The most common prefixes are:

Prefix
Meaning
m
Mother
f
Father
h or hv
Home visit
p
Primary caregiver
k
Kid (interview with the child)
kind_
Kindergarten teacher
t
Teacher
ffcc_[something]
Child care surveys. For a full list of the [something] see this documentation.

Constructed variables: An additional prefix

Some variables have been constructed based on responses to several questions. These are often variable that are particularly relevant to the models many researchers want to estimate. These variables add the additional prefix c to the front of the variable name. For instance, cm1ethrace indicates constructed mother’s wave 1 race/ethnicity.

What are the wave numbers?

It’s easy to talk about the questionnaires by the rough child ages at which they were conducted. This is how the documentation website is organized. However, the variable names always refer to wave numbers, not child ages. It’s important not to get confused on this point. The table below summarizes the mapping between wave numbers and approximate child ages.

Wave number
Approximate child age
1
0, often called “baseline”
2
1
3
3
4
5
5
9

What are the question numbers?

Question numbers typically begin with a letter and a number, i.e. a3.

  • In questionnaires, questions are referred to by question number alone.
  • In codebooks, questions are referred to by a prefix and then a question number.

How do I find a question I care about?

You might want to find a particular question. For instance, when modeling eviction or material hardship at age 15, you might want to include the same measures collected at age 9. If you ctrl+F or cmd+F for “evicted” in the mother or father codebook or questionnaire at age 9, you will find these variables. In this case, they are m5f23d and f5f23d.

GPA

Uncategorized No comments
featured image

GPA measures academic achievement.

We want to know:

  • What helps disadvantaged children to beat the odds and succeed academically?
  • What derails children so that they perform unexpectedly poorly?

Survey question

How we cleaned the data

Our measure of GPA is self-reported by the child at approximately age 15. We marked as NA the GPAs of children who were not interviewed, reported no grade, refused to answer, did not know, or were homeschooled, for any of the four subjects. For children with valid answers, we averaged the responses for all four subjects, then subtracted this number from 5 to produce an estimate of child GPA ranging from 1 to 4. In our re-coded variable, a GPA of 4.0 indicates that the child reported straight As, while a GPA of 1.0 indicates that the child reported getting all grades of D or lower.

Distribution in the training set

Scientific motivation

Helping kids “beat the odds” academically is a fundamental goal of education research; academic success can be the key to breaking the cycle of poverty. Free public education is often referred to as a great equalizer, yet children who grow up in disadvantaged families consistently underperform their more affluent peers on average.

However, the average is not the whole story. Some kids do well despite being expected to do poorly. In fact, the amount of unexplained variation in educational achievement is enormous: social science models typically have R-squared values of 0.2 or less. The poor predictive performance of social science models of educational attainment has long been known. In the now-classic 1972 book Inequality: A Reassessment of the Effect of Family and Schooling in America, Harvard social scientist Christopher Jencks argued that random chance played a larger role than measured family background characteristics in determining socioeconomic outcomes.

While social scientists have learned some about what helps children succeed academically in the decades since 1972, a huge proportion of the variance remains buried in the error term of regression models. Is this term truly random chance, or is there “dark matter” out there in the form of unmeasured but important variables that help some kids to beat the odds?

By submitting a model for GPA at age 15, you help us in our quest to find this dark matter. Based on our collaborative model combining all of the individual submissions, we will identify our best guess as a scientific community about how children are expected to perform at age 15. Then, we will identify a subset of children performing much better and worse than expected. We will interview these children to answer the question: what unmeasured variables are common to the kids who are beating the odds, which we do not observe among the children who are struggling unexpectedly?

When you participate, you help us target interviews at the children whose outcomes are least well explained by our measured variables. These children are best-positioned for exploratory qualitative research to uncover unmeasured but important factors. Interviews may help us learn how some kids beat the odds, these results may drive future deductive research to evaluate the causal effect of these unmeasured variables, and ultimately we hope that policymakers can intervene on the “dark matter” we find in order to improve the lives of other disadvantaged children in the future.

Grit

Uncategorized No comments
featured image

Grit is a measure of passion and perseverance. It predicts success in many domains. The causes of grit remain unknown.

We want to know: What makes some kids unexpectedly grittier than others in adolescence?

Survey questions

The survey questions are adapted from the grit scale proposed by Duckworth, Peterson, Matthews, and Kelly (2007).

How we cleaned the data

Our measure of grit is based on the four questions above, as answered by the child at approximately age 15. These items were part of a longer battery of questions capturing a wider range of attitudes, emotions, and outlooks. Children who refused any of the four questions or didn’t know how to answer were coded as NA, as were children who did not complete the age 15 interview. For children with four valid answers, we averaged the answers and subtracted the result from 5. This created a continuous scale ranging from 1 to 4. The way we have recoded it, a high score on our variable indicates more grit.

Distribution in the training set

Scientific motivation

Do you keep working when the going gets tough? If so, you probably have a lot of grit.

University of Pennsylvania psychologist and MacArthur “Genius” award winner Angela Duckworth has found that grit predicts all kinds of measures of success: persistence through a military training program at West Point, advancement through the Scripps National Spelling Bee, and educational attainment, to name a few. Duckworth’s work has reached the general public through her TED talk and NY Times bestseller Grit: The Power of Passion and Perseverance.

While it is clear that grit predicts success, it is less clear what causes some people to be grittier than others. How can we help more disadvantaged children to exhibit grit?

A few researchers have begun to examine this question. In their book Coming of Age in the Other America, social scientists Stefanie DeLuca (Johns Hopkins University), Susan Clampet-Lundquist (St. Joseph’s University), and Kathryn Edin (Johns Hopkins University) argue that kids growing up in impoverished urban neighborhoods are often inspired to have grit when they develop passion for an “identity project”: a personal passion that gives them something to aspire toward beyond the challenges of the present day. This ethnographic work exemplifies how qualitative social science research may be able to uncover previously unmeasured sources of grit.

How much more could we learn if qualitative interviews were targeted at the kids best positioned to be informative about unmeasured sources of grit? By participating, you can help us build a community model for grit measured in adolescence. The combined submissions of all who participate will identify our common agreement about the amount of grit we expect to see in the Fragile Families respondents, given all of their childhood experiences from birth to age 9. By interviewing children who have much more or much less grit than we all expect, we will uncover unmeasured factors that predict grit. It is our hope that these unmeasured factors can inform future deductive evaluations and ultimately policy interventions to help kids break the cycle of poverty by developing grit.

Grit is an important predictors of success, but the causes of grit are largely unknown. Be part of the solution and help us target interviews toward those best positioned to show us these unmeasured sources of grit. Apply to participate, build a model, and upload your contribution.

Material hardship

Uncategorized No comments
featured image

Material hardship is a measure of extreme poverty.

We want to know:

  • What helps families to unexpectedly escape extreme poverty?
  • What leads families to fall into extreme poverty unexpectedly?

Survey questions

How we cleaned the data

These questions were asked of the child’s primary caregiver when the child was approximately age 15. We marked as NA material hardship for children whose caregivers did not participate in the survey, didn’t know the answer to one or more questions, or refused one or more questions. Our material hardship measure is the proportion of these 11 questions for which the child’s caregiver answered “Yes.” Material hardship ranges from 0 to 1, with higher values indicating more material hardship.

Distribution in the training set

Scientific motivation

In his 1964 State of the Union Address, President Lyndon B. Johnson declared an “all-out war on human poverty and unemployment in these United States.” In the decades since, America has taken great strides toward this goal. However, severe deprivation remains a problem today. In $2 a Day: Living on Almost Nothing in America, Johns Hopkins sociologist Kathryn Edin and University of Michigan social work professor H. Luke Schaefer bring us into the lives of American families living in the nightmare of extreme poverty.

What can be done to reduce extreme poverty? By identifying families who unexpectedly escape extreme poverty, as well as those who unexpectedly fall into it, we hope to uncover unmeasured but important factors that affect severe deprivation.

Measuring extreme poverty is hard. The material hardship scale was originally proposed in a 1989 paper by Susan Mayer and Christopher Jencks, then social scientists at Northwestern University. Rather than focusing solely on respondent’s incomes, Mayer and Jencks asked respondents about particular needs that they were unable to meet. This scale proved fruitful and captured a dimension of poverty above and beyond what was captured by income alone. With minor modifications, the material hardship scale became a standard measure in the federal Survey of Income and Program Participation (SIPP), and it has been included in several waves of the Fragile Families Study.

By participating, you help us to identify the level of material hardship that is expected at age 15 for each of the families in the Fragile Families Study. By combining all of the submissions in one collaborative model, we will produce the best guess by the scientific community of the experiences we expect for families at age 15. Undoubtedly, some families will report much more or much less material hardship than we expect. By interviewing these families, we hope to discover unmeasured but important factors that are associated with sudden dives into material hardship or unexpected recoveries.

The results of these exploratory interviews can then inform future deductive social science research and help us propose policies that could help families to escape severe deprivation. You can help us to target these interviews at the families best positioned to help. Be a part of the solution: apply to participate, build a model, and upload your contribution.

Eviction

Uncategorized No comments
featured image

Eviction is a traumatic experience in which families are forced from their homes for not paying the rent or mortgage.

We want to know: As children transition into adulthood, does eviction cause negative outcomes?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0. We additionally coded as 1 a small group of respondents who answered in a previous question that they were evicted in the past year, and thus were skipped over this question.

Distribution in the training set

Scientific motivation

In the New York Times bestseller Evicted: Poverty and Profit in the American City, Harvard sociologist and MacArthur “Genius” award winner Matthew Desmond describes fieldwork in which he spent several years living alongside tenants being evicted in low-income Milwaukee neighborhoods. Desmond helped tenants move their things into trucks, followed landlords into eviction court, and watched as children moved from school to school while their families searched for housing. Eviction literally uproots families from their homes, and it is most prevalent among the most disadvantaged urban families. Given Desmond’s qualitative account, it is plausible that eviction may have substantial negative effects on child outcomes in early adulthood.

Emerging evidence further suggests that eviction is sufficiently prevalent to warrant policy attention. Researchers at the Federal Reserve Bank of Atlanta have examined administrative records to find that 12.2 percent of rental households were evicted and forcibly displaced in 2015 in Fulton County, GA (Raymond et al. 2016). Likewise, the Milwaukee Area Renters Study found that 13 percent of private renters experienced a forced move during the 2 years referenced in a survey questionnaire (Desmond and Schollenberger 2015). If eviction creates disadvantage for children, it is sufficiently prevalent to have wide-reaching impacts.

However, untangling cause from selection is no simple task (see our blog post on causal inference and this interview with Matthew Desmond on the topic). It is easy to show that children who experience an eviction have worse outcomes later in life; it is hard to show that these outcomes are not caused by other factors that are correlated with eviction. In a quantitative study using propensity score matching methods on earlier waves of the Fragile Families and Child Wellbeing Study, Desmond and Kimbro (2015) find that eviction is associated with negative outcomes, net of obvious sources of selection bias.

We applaud the work of all the individual research teams that have placed eviction on the table as a scientific concept of interest. However, any individual research team can only adjust for a selected group of observed covariates, and results can be sensitive to the set chosen. We ask you to contribute a model for the probability that a child experiences an eviction between the age 9 and age 15 interviews of the Fragile Families and Child Wellbeing Study, given any set of the birth to age 9 characteristics you choose to include, and any statistical model you choose to employ. Together, we will produce a collaborative propensity score model that the entire scientific community can agree upon, which is not sensitive to researcher decisions. We will then interview a subset of children who are matched on the propensity score, to assess the plausibility of the conditional ignorability assumption required for causal inference (see our blog post on causal inference). If the interview suggest that causal inference may be warranted, we will use these collaborative propensity scores to estimate the causal effect of eviction on child outcomes to be measured several years from now, when children are approximately 22 years old.

In summary, this research agenda will produce estimates of the effect of adolescent eviction on attainment during the transition to adulthood. These collaborative estimates will be robust to the decisions of individual researchers. The assumptions needed for causal inference will be validated in qualitative interviews. These steps will maximize the validity of causal inference in the absence of a randomized experiment.

To achieve these goals, we need your help. Apply to participate, build a model, and upload your contribution!

Layoff

Uncategorized No comments
featured image

Being laid off is a sudden and often unexpected experience with potentially detrimental consequences for one’s family.

We want to know: When a caregiver is laid off, do adolescent children suffer collateral damage?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who have never worked or have not worked since the age 9 interview (in approximately the prior 6 years) were coded as NA; these respondents are not at risk for a layoff. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0.

Distribution in the training set

Scientific motivation

A steady jobs can provide financial security to a family. However, this security can be upset by plant closures, downsizing, and other economic shifts that lead caregivers to lose their jobs. In addition, some caregivers may be fired but report in a survey that they have been laid off. In any case, layoff of a caregiver could create dramatic disadvantages for adolescents nearing the transition to adulthood.

Social scientists worry about layoffs because precarious work is on the rise. In Good Jobs, Bad Jobs, University of North Carolina sociologist Arne L. Kalleberg outlines economic shifts that have made steady employment harder to come by in the United States over the past several decades. Gone are the days when workers could count on a single job to carry them throughout their careers – job changes and unexpected unemployment are now commonplace.

Social scientists also worry about layoffs because they may negatively influence child achievement. Sociologists Jennie E. Brand (UCLA) and Juli Simon Thomas (Harvard) have shown in an article published in the American Journal of Sociology that maternal job displacement reduces a child’s chances of high school and college completion by 3 – 5 percentage points, with even larger effects among those unlikely to experience job displacement and those whose mothers experienced job displacement while the child was an adolescent. When caregivers lose their jobs, children suffer collateral damage.

However, causal conclusions always depend on modeling assumptions. The propensity score matching methods used in the paper cited above assume that the model for the probability of job displacement is correctly specified, and that there are no unmeasured variables that affect job displacement and also directly affect child outcomes. To learn more on these assumptions, see our blog post on causal inference.

The Fragile Families Study follows a particularly disadvantaged sample of urban children, for whom we would especially like to know the effect of maternal layoff on adult outcomes. By participating, you help us to produce a collaborative propensity score model that combines the best of all the individual submissions into a single metric that is robust to the modeling decisions of individual researchers. This model will also help us target interviews at the children best positioned to lend suggestive evidence about the plausibility of the untestable conditional ignorability assumption required for causal inference. If this assumption seems credible after interviews, we will use our collaborative propensity scores to estimate the causal effect of caregiver layoff on child outcomes in early adulthood, once those outcomes are measured several years from now.

By participating, you can be part of an extending our body of knowledge to provide maximally robust causal evidence with observational data about the effect of caregiver layoffs on child outcomes in a disadvantaged urban sample. Results will inform policy changes about whether support for steady caregiver employment could help disadvantaged children.

Be a part of the solution. Apply to participate, build a model, and upload your contribution.

Job training

Uncategorized No comments
featured image

Policymakers often propose programs to retrain the workforce to be able to contribute in a 21st century economy.

We want to know: Do job skills programs utilized by caregivers yield collateral benefits for disadvantaged children?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0.

Distribution in the training set

Scientific motivation

One way to raise people’s standard of living is to raise their human capital: the skills that promote productive participation in the labor force. Human capital investments are perhaps more important now than ever before given rapid globalization and computerization of the economy. Does participation in job training programs designed to build computer, language, or other skills improve the well-being of families? When caregivers participate in these programs, do children benefit indirectly?

Social scientists have long been interested in policy interventions to promote employment. This research has also been closely tied to the development of statistical methods for causal inference with observational data. In the 1970s, the National Supported Work Demonstration (NSW) randomly assigned some disadvantaged, non-employed workers to a job training program that included guaranteed employment for a short period of time. Others were randomly assigned to a control condition. The treatment led to measurable increases in earnings in subsequent years, suggesting that job training might be useful.

University of Chicago economist Robert LaLonde saw a new use for these data. Given that experimental results provided the “true” causal effect of job training on earnings, LaLonde wanted to know whether econometric techniques that statistically adjust for selection bias could recover this “true” effect in a non-experimental setting. In general, these statistical adjustments failed to recapture the “true” effect, and LaLonde’s 1986 paper became highly cited as evidence of the extreme difficulty of drawing causal inferences from observational data.

However, the story did not end there. About the same time, a pair of statisticians developed a new method for identifying causal effects: propensity score matching. In an enormously influential 1983 paper, Paul R. Rosenbaum (then of the University of Wisconsin) and Donald B. Rubin (then of the University of Chicago) showed that the average causal effect of a binary treatment on an outcome could be identified by matching treated units with untreated units who had similar probabilities of treatment given observed pre-treatment characteristics. The Rosenbaum and Rubin theorem held only in a sufficiently large sample and only when one estimated the propensity score correctly without omitting any important variables that might affect the treatment and directly affect the outcome. Despite these limitations, the key idea stuck: under certain assumptions, one can use observational data to try to re-create the type of data one would get in a randomized experiment where background characteristics no longer determine treatment assignment.

Empowered with propensity scores, two other statisticians reassessed LaLonde’s findings: could propensity score methods recover the experimental benchmark in the job training example? Raheev H. Dehejia (then of Columbia University) and Sadek Wahba (then of Morgan Stanley) found that they could. In two highly-cited papers (paper 1 and paper 2), they demonstrated that propensity score methods came much closer to recovering the experimental truth than the econometric approaches used by LaLonde.

The saga of job training and causal inference has continued to the present day. For instance, a 2002 paper by economists Jeffrey Smith (then of the University of Maryland) and Petra Todd (University of Pennsylvania) demonstrated that propensity score methods can be highly sensitive to researcher decisions. Since then, numerous statisticians and social scientists have used the job training example to demonstrate the usefulness of new matching methods: entropy balancing (Hainmueller 2012), genetic matching (Diamond and Sekhon 2013), and the covariate balancing propensity score (Imai and Ratkovic 2014), to name a few.

Be part of the next step

Clearly there is a lot of interest in human capital formation through job training. There is also interest in methods to infer causal effects from observational data. How does the Fragile Families Challenge fit in?

A slightly different treatment

The LaLonde (1986) paper and subsequent studies focused on an intensive job training program that connected non-employed individuals with jobs. The “treatment” variable which you will predict is much milder: participation in any classes to improve job skills, such as computer training or literacy classes. Respondents who enroll in these classes are not necessarily non-employed.

A robust propensity score model

One piece of conventional wisdom about propensity score methods is that one should be careful about selecting the pretreatment variables to include in the model, and one must model their relationship to the treatment variable appropriately. This is where you can help! Together we will build a highly robust  community model for the probability of job training. This community model will take all of our best ideas and create one product on which we can all agree.

Specifying models before outcomes occur

A second piece of conventional wisdom of propensity score modeling is that it allows one to conduct all modeling and matching before even looking at the outcome variable. In our case, the ultimate outcome variables are not yet measured: we will examine the effect of caregiver job training on child outcomes in early adulthood. These outcomes will be measured several years from now, long after we lock in our community propensity score model.

Evaluating assumptions

All covariate-adjustment methods to draw causal inferences from observational data rely on the assumption of conditional ignorability (for more about this assumption, see our blog post about causal inference). Through targeted interviews with caregivers, we can provide suggestive evidence as to whether the conditional ignorability assumption holds.

You can help

Be a part of the next step in observational causal inference to evaluate the effect of job training programs. Apply to participate, build a model, and upload your contribution.

Blog posts

Uncategorized 3 comments

In addition to the general Fragile Families documentation, the following blog posts provide more details about the data and the scientific goals of the project.

Weekly office hours

Uncategorized No comments
featured image

From 3:30-4:30pm Eastern Daylight Time every Wednesday, one of us will be at the computer to answer your questions. At those times, please video call us via Google Hangout at fragilefamilieschallenge@gmail.com.

For more immediate feedback from the full community of users, post on our discussion forum for the Fragile Families Challenge.

For concerns you do not wish to share with the entire community, you can also contact us privately.

Discovering unmeasured factors

Uncategorized No comments
featured image

Beating the odds

Despite coming from disadvantaged backgrounds, some kids manage to “beat the odds” and achieve unexpectedly positive outcomes. Meanwhile, other kids who seem on track sometimes struggle unexpectedly. Policymakers would like to know what variables are associated with “beating the odds” since this could generate new theories about how to help future generations of disadvantaged children.

Once we combine all of the submissions to the Fragile Families Challenge into one collaborative guess for how children will be doing on each outcome at age 15, we will identify a small number of children doing much better than expected (“beating the odds”), and another set who are doing much worse than expected (“struggling unexpectedly”). By interviewing these sets of children, we will be well-positioned to learn what factors were associated with who ended up in each group.

What we learn in these interviews will affect the questions asked in future waves of the Fragile Families Study, and possibly other studies like it. By combining quantitative models with inductive interviews, the Fragile Families Challenge offers a new way to improve surveys in the future and expand the range of social science theories. In the remainder of this blog, we discuss current approaches to survey design and the potential contribution of the Fragile Families Challenge.

Deductive survey design: Evaluating theories

Social scientists often design surveys using deductive approaches based on theoretical perspectives. For instance, economists theorize about how one’s employment depends on the hypothetical wage offer (often called a “reservation wage”) one would have to be given before one would leave other unpaid options behind and opt into paid labor. Motivated by this theoretical perspective, Fragile Families and other surveys have incorporated questions like: “What would the hourly wage have to be in order for you to take a job?”

However, even the best theoretically-informed social science measures perform poorly at the task of predicting outcomes. R-squared, a measure of a model’s predictive validity, often ranges from 0.1 to 0.3 in published social science papers. Simply put, a huge portion of the variance in outcomes we care about is unexplained by the predictors social scientists have invented and put their faith in.

Inductive interviews: A source of new hypotheses

How can we be missing so much? Part of the problem might be that academics who propose these theoretical perspectives often spend their lives far from the context in which the data are actually collected. An alternative, inductive approach is to conduct open-ended interviews with interesting cases and allow the theory to emerge from the data. This approach is often used in ethnographic and other qualitative work, and points researchers toward alternative perspectives they never would have considered on their own.

Inductive approaches have their drawbacks: researchers might develop a theory that works well for some children, but does not generalize to other cases. Likewise, the unmeasured factors we discover will not necessarily be causal. However, inductive interviews will generate hypotheses that can be later evaluated using deductive approaches in new datasets, and finally evaluated with randomized controlled trials.

An ideal combination: Cycling between the two

To our knowledge, the Fragile Families Challenge is the first attempt to cycle between these two approaches. The study was designed with deductive approaches: researchers asked questions based on social science theories about the reproduction of disadvantage. However, we can use qualitative interviews to inductively learn new variables that ought to be collected. Finally, we will incorporate these variables in future waves of data collection to deductively evaluate theories generated in the interviews, using out-of-sample data.

By participating in the Fragile Families Challenge, you are part of a scientific endeavor to create the surveys of the future.

Missing data

Uncategorized 1 comment
featured image

This blog post

  1. discusses how missing data is coded in the Fragile Families study
  2. offers a brief theoretical introduction to the statistical challenges of missing data
  3. links to software that implements one solution: multiple imputation

Of course, you can use any strategy you want to deal with missing values: multiple imputation is just one strategy among many.

Missing data in the Fragile Families study

Missing data is a challenge in almost all social science research. It generally comes in two forms:

  1. Item non-response: Respondents simply refuse to answer a survey question.
  2. Survey non-response: Respondents cannot be located or refuse to answer any questions in an entire wave of the survey.

While the first problem is common in any dataset, the second is especially prevalent in panel studies like Fragile Families, in which the survey is composed of interviews conducted at various child ages over the course of 15 years.

While the survey documentation details the codes for each variable, a few global rules summarize the way missing values are coded in the data. The most common responses are bolded.

  • -9 Not in wave – Did not participate in survey/data collection component
  • -8 Out of range – Response not possible; rarely used
  • -7 Not applicable (also -10/-14) – Rarely used for survey questions
  • -6 Valid skip – Intentionally not asked question; question does not apply to respondent or response known based on prior information.
  • -5 Not asked “Invalid skip” – Respondent not asked question in the version of the survey they received.
  • -3 Missing – Data is missing due to some other reason; rarely used
  • -2 Don’t know – Respondent asked question; Responded “Don’t Know”.
  • -1 Refuse – Respondent asked question; Refused to answer question

When responses are coded -6, you should look at the survey questionnaire to determine the skip pattern. What did these respondents tell us in prior questions that caused the interviewer to skip this question? You can then decide the correct way to code these values given your modeling approach.

When responses are coded -9, you should be aware that many questions will be missing for this respondent because they missed an entire wave of the survey.

For most other categories, an algorithmic solution as described below may be reasonable.

Theoretical issues with missing data

Before analyzing data with missing values, researchers must make assumptions about how some data came to be missing. One of the most common assumptions is the assumption that data are missing at random. For this assumption to hold, the pattern of missingness must be a function of the other variables in the dataset, and not a function of any unobserved variables once those observed are taken into account.

For instance, suppose children born to unmarried parents are less likely to be interviewed at age 9 than those born to married parents. Since the parents’ marital status at birth is a variable observed in the dataset, it is possible to adjust statistically for this problem. Suppose, on the other hand, that some children miss the age 9 interview because they suddenly had to leave town to attend the funeral of a their second cousin once removed. This variable is not in the dataset, so no statistical adjustment can fully account for this kind of missingness.

For a full theoretical treatment, we recommend

One solution: Imputation

Once we assume that data are missing at random, a valid approach to dealing with the missing data is imputation. This is a procedure whereby the researcher estimates the association between all of the variables in the model, then fills in (“imputes”) reasonable guesses for the values of the missing variables.

The simplest version of imputation is known as single imputation. For each missing value, one would use an algorithm to guess the correct value for every missing observation. This produces one complete dataset, which can be analyzed like any other. However, single imputation fails to account for our uncertainty about the true values of the missing cases.

Multiple imputation is a procedure that produces several data sets (often in the range of 5, 10, or 30), with slightly different imputed values for the missing observations in each data set. Differences across the datasets capture our uncertainty about the missing values. One can then estimate a model on each imputed dataset, then combine estimates across the imputed datasets using a procedure known as Rubin’s rules.

Ideally, one would conduct multiple imputation on a dataset with all of the observed variables. In practice, this can become computationally intractable in a dataset like Fragile Families with thousands of variables. In practice, researchers often select the variables to be included in their model, restrict the data to only those variables, and then multiply impute missing values in this subset.

Implementing multiple imputation

There are many software packages to implement multiple imputation. A few are listed below.

In R, we recommend Amelia (package home, video introduction, vignette, documentation) or MICE (package home, introductory paper, documentation). Depending on your implementation, you may also need mitools (package home,vignette, documentation) or Zelig (website) to combine estimates from several imputed datasets.

In Stata, we recommend the mi set of functions as described in this tutorial.

In SPSS, we recommend this tutorial.

In SAS, we recommend this tutorial.

This set is by no means exhaustive. One curated list of software implementations is available here.

Evaluating submissions

Uncategorized No comments
featured image

We will evaluate submissions based on predictive validity, measured in the held-out test data by mean squared error loss for continuous outcomes and Brier loss for binary outcomes.

A leaderboard will rank submissions according to these criteria, using a set of held-out data. After the challenge closes, we will produce a finalized ranking of submissions based on a separate set of withheld true outcome data.

Each of the 6 outcomes will be evaluated and ranked independently – feel free to focus on predicting one outcome well!

What does this mean for you?

You should produce a submission that performs well out of sample. Mean squared error is a function of both bias and variance. A linear regression model with lots of covariates is an unbiased predictor, but it might overfit the data and produce predictions that are highly sensitive to the sample used for training. Computer scientists often refer to this problem as the challenge of distinguishing the signal from the noise; you want to pick up on the signal in the training data without picking up on the noise.

An overly simple model will fail to pick up on meaningful signal. An overly complex model will pick up too much noise. Somewhere in the middle is a perfect balance – you can help us find it!

Causal inference

Uncategorized No comments
featured image

The Fragile Families Challenge presents a unique opportunity to probe the assumptions required for causal inference with observational data. This post introduces these assumptions and highlights the contribution of the Fragile Families Challenge to this scientific question.

 

Causal inference: The problem

Social scientists and policymakers often wish to use empirical data to infer the causal effect of a binary treatment D on an outcome Y. The causal effect for each respondent is the potential outcome that each observation would take under treatment (denoted Y(1)) minus the potential outcome that each observation would take under control (denoted Y(0)). However, we immediately run into the fundamental problem of causal inference: each observation is observed either under the treatment condition or under the control condition.

 

The solution: Assumptions of ignorability

The gold standard for resolving this problem is a randomized experiment. By randomly assigning treatment, researchers can ensure that the potential outcomes are independent of treatment assignment, so that the average difference in outcomes between the two groups can only be attributable to treatment. This assumption is formally called ignorability.

Ignorability: {Y(0),Y(1)} 丄 D

Because large-scale experiments are costly, social scientists frequently draw causal inferences from observational data based on a simplifying assumption of conditional ignorability.

Conditional ignorability: {Y(0),Y(1)} 丄 D | X

Given a set of covariates X, conditional ignorability states that treatment asignment D is independent of the potential outcomes that would be realized under treatment Y(1) and control Y(0). In other words, two observations with the same set of covariates X but with different treatment statuses can be compared to estimate the causal effect of the treatment for these observations.

 

Assessing the credibility of the ignorability assumption

Conditional ignorability is an enormous assumption, yet it is what the vast majority of social science findings rely on. By writing the problem in a Directed Acyclic Graph (DAG, Pearl 2000), we can make the assumption more transparent.

X represents pre-treatment confounders that affect both the treatment and the outcome. Though it is not the only way to do so, researchers often condition on X by estimating the probability of treamtent given X, denoted P(T | X). Once we account for the differential probability of a treatment by the background covariates (through regression, matching, or some other method), we say we have blocked the noncausal backdoor paths connecting T and Y through X.

The key assumption in the left panel has to do with Ut. We assume that all unobserved variables that affect the treatment (Ut) have no affect on the outcome Y, except through T. This is depicted graphically by the dashed line from Ut to Y, which we must assume does not exist for causal inferences to be valid.

Researchers often argue that conditional ignorability is a reasonable assumption if the set of predictors included in X is extensive and detailed. The Fragile Families Challenge is an ideal setting in which to test the credibility of this assumption: we have a very detailed set of predictor variables X collected from birth through age 9, which occur temporally prior to treatments reported at age 15.

Nevertheless, the assumption of conditional ignorability is untestable. Interviews may provide some insight to the credibility of this assumption.

 

Goal of the Fragile Families Challenge: Targeted interviews

Through targeted interviews with particularly informative children, we might be able to learn something about the plausibility of the conditional ignorability assumption.

One of the binary variables in the Fragile Families Challenge is whether a child was evicted from his or her home. We will treat this variable as T. We want to know the causal effect of eviction on a child’s chance of graduating from high school (Y). In the Fragile Families Challenge, the set of observed covariates X is all 12,000+ predictor variables included in the Fragile Families Challenge data file.

Based on the ensemble model from the Fragile Familie Challenge, we will identify 20 children who were evicted, and 20 similar children who had similar predicted probabilities of eviction but were not evicted. We will interview these children to find out why they were evicted.

 

Potential interviews in support of conditional ignorability:

Suppose we find that children were evicted because their landlords were ready to retire and wanted to get out of the housing market. Those who were not evicted had younger landlords. It might be plausible that the age of one’s landlord is an example of Ut: a variable that affects eviction but has no effect on high school graduation except through eviction. While this would not prove the conditional ignorability assumption, the assumption might seem reasonable in this case.

 

Potential interviews that discredit conditional ignorability:

Suppose instead that we find a different story. Gang activity increased in the neighborhoods of some families, escalating to the point that landlords decided to get out of the business and evict all of their tenants. Other families lived in neighborhoods with no gang activity, and they were not evicted. In addition to its effect on eviction, it is likely that gang activity would alter the chances of high school graduation in other ways, such as by making students feel unsafe at school. In this example, gang activity plays the role of Uty and would violate the assumption of conditional ignorability.

 

Summary

Because costs prohibit randomized experiments to evaluate all potential treatments of interest to social scientists, scholars frequently rely on the assumption of conditional ignorability to draw causal claims from observational data. This is a strong and untestable assumption. The Fragile Families Challenge is a setting in which the assumption may be plausible, due to the richness of the covariate set X, which includes over 12,000 pre-treatment variables chosen for their potentially important ramifications for child development.

By interviewing a targeted set of children chosen by ensemble predictions of the treatment variables, we will shed light on the credibility of the ignorability assumption.

upload your contribution

Uncategorized No comments
featured image

This post will walk you through the steps to prepare your files for submission and upload them to our submission platform.

1. Save your predictions as prediction.csv.

This file should be structured the same way as the “prediction.csv” file provided as part of your data bundle.

This file should have 4,242 rows: one for each observation in the test set.

We are asking you to make predictions for all 4,242 cases, which includes both the training cases from train.csv and the held-out test cases. We would prefer that you not simply copy these cases from train.csv to prediction.csv. Instead, please submit the predictions that come out of your model. This way, we can compare your performance on the training and test sets, to see whether those who fit closely to the training set perform more poorly on the test set (see our blog discussing overfitting). Your scores will be determined on the basis of test observations alone, so your predictions for the cases included in train.csv will not affect your score.
There are some observations that are truly missing: we do not have the true answer for these cases because respondents did not complete the interview or did not answer the question. This is true for both the training and the test sets. Your predictions for these cases will not affect your scores. We are asking you to make predictions for missing cases because it is possible that we will find those respondents sometime in the future and uncover the truth. It will be scientifically interesting to know how well the community model was able to predict these outcomes which even the survey staff did not know at the time of the Challenge.

This file should have 7 columns for the ID number and the 6 outcomes. They should be named:

challengeID, gpa, grit, materialHardship, eviction, layoff, jobTraining

The top of the file will look like this (numbers here are random). challengeID numbers can be in any order.

 

2. Save your code.

3. Create a narrative explanation of your study. This should be saved in a file called “narrative” and can be a text file, PDF, or Word document.

At the top of this narrative explanation, tell us your names of everyone on the team that produced the submission, or your name if you worked alone, in the format:

Homer Simpson,
homer@gmail.com

Marge Simpson,
msimpson@gmail.com

Then, tell us about how you developed the submission. This might include your process for preparing a the data for analysis, methods you used in the analysis, how you chose the submission you settled on, things you learned, etc.

4. Zip all the files together in one folder.

It is important that the files be zipped in a folder with no sub-directories. Instructions are different for Mac and windows.

On Mac, highlight all of the individual files.

Right click and choose “Compress 3 items”.

On Windows, highlight all of the individual files.

Right click and choose
Send to -> Compressed (zipped) folder

5. Upload the zipped folder to the submission site.

Click the “Participate” tab at the top, then the “Submit / View Results” tab on the left. Click the “Submit” button to upload your submission.

6. Wait for the platform to evaluate your submission.

Click “Refresh status” next to your latest submission to view its updated status and see results when they are ready. If successful, you will automatically be placed on the leaderboard when evaluation finishes.

build a model

Uncategorized No comments
featured image

Take our data and build models for the 6 child outcomes at age 15. Your model might draw on social science theories about variables that affect the outcomes. It might be a black-box machine learning algorithm that is hard to interpret but performs well. Perhaps your model is some entirely new combination no one has ever seen before!

The power of the Fragile Families Challenge comes from the heterogeneity of quality individual models we receive. By working together, we will harness the best of a range of modeling approaches. Be creative and show us how well your model can perform!

There are missing values. What do I do?

See our blog post on missing data.

What if I have several ideas?

You can try them all and then choose the best one! Our submission platform allows you to upload up to 10 submissions per day. Submissions will instantly be scored, and your most recent submission will be placed on the leaderboard. If you have several ideas, we suggest you upload them each individually and then upload a final submission based on the results of the individual submissions.

What if I don’t have time to make 6 models?

You can make predictions for whichever outcome interests you. To upload a submission with the appropriate file size, make a simple numeric guess for the rest of the outcomes. For instance, you might develop a careful model for grit, and then guess the mean of the training values for all of the remaining five observations. This would still allow you to upload 6 sets of predictions to the scoring function.

Apply to participate

Uncategorized No comments

The Fragile Families Challenge is now closed. We are no longer accepting applications!


What will happen after I apply?

We will review your application and be in touch by e-mail. This will likely take 2-3 business days. If we invite you to participate, you will be asked to sign a data protection agreement. Ultimately, each participant will be given a zipped folder which consolidates all of the relevant pieces of the larger Fragile Families and Child Wellbeing Study in three .csv files.

background.csv contains 4,242 rows (one per child) and 12,943 columns:

  • challengeID: A unique numeric identifier for each child.
  • 12,942 background variables asked from birth to age 9, which you may use in building your model.

train.csv contains 2,121 rows (one per child in the training set) and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables (each variable name links to a blog post about that variable)
    1. Continuous variables: grit, gpa, materialHardship
    2. Binary variables: eviction, layoff, jobTraining

prediction.csv contains 4,242 rows and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables, as in train.csv. These are filled with the mean value in the training set. This file is provided as a skeleton for your submission; you will submit a file in exactly this form but with your predictions for all 4,242 children included.

Understanding the background variables

To use the data, it may be useful to know something about what each variable (column) represents. Full documentation is available here, but this blog post distills the key points.

Waves and child ages

The background variables were collected in 5 waves.

  • Wave 1: Collected in the hospital at the child’s birth.
  • Wave 2: Collected at approximately child age 1
  • Wave 3: Collected at approximately child age 3
  • Wave 4: Collected at approximately child age 5
  • Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.

Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question  a4.

  1. The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child’s age (baby’s age).
  2. m1, m2, m3, m4, m5: Questions asked of the child’s mother in wave 1 through wave 5.
  3. f1,...,f5: Questions asked of the child's father in wave 1 through wave 5
  4. hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
  5. p5: Questions asked of the primary caregiver in wave 5.
  6. k5: Questions asked of the child (kid) in wave 5
  7. ffcc: Questions asked in various child care provider surveys in wave 3
  8. kind: Questions asked of the kindergarten teacher
  9. t5: Questions asked of the teacher in wave 5.

Ready to work with the data?

See our posts on building a model and working with missing data.