Matt Salganik

December 3, 2021

Update about Fragile Families Challenge community paper

Uncategorized No comments

We recently published a correction about the community paper from the Fragile Families Challenge (https://www.pnas.org/content/118/50/e2118703118). In brief, when we published our paper, we also posted our code on Dataverse (https://doi.org/10.7910/DVN/CXSECU), consistent with open science practices. Subsequently, one research team—Charles Rahal and Mark Verhagen—contacted us about a possible bug in our code. We have investigated this report carefully, and we agree that there was a bug in our code. While investigating this bug, we found two other unrelated issues. In the end, however, none of these issues impacted the conclusions of the article in any way.

In addition to publishing the correction, we have updated our replication materials and alerted any researchers we know who are working with the replication materials. If you are working with these replication materials, we recommend you use the updated version. We’d be happy to answer any questions about them.

Matt Salganik

August 5, 2020

“Prediction, Machine Learning, and Individual Lives: An Interview with Matthew Salganik” published in Harvard Data Science Review

Uncategorized No comments

An interview with Challenge co-organizer Matthew Salganik, conducted by Lauren Maffeo and Cynthia Rudin, was just published in the Harvard Data Science Review. The piece—“Prediction, Machine Learning, and Individual Lives: An Interview with Matthew Salganik”—covers the Fragile Families Challenge and its implications for scientists and policy makers.

Matt Salganik

August 2, 2020

Publication of paper and Socius Special Collection

Uncategorized No comments

We are happy to (belatedly) announce the publication of both a paper about the Fragile Families Challenge with 112 authors in the Proceedings of the National Academy of Sciences and the Socius Special Collection.

Here are some links:

Salganik et al. 2020. “Measuring the predictability of life outcomes with a scientific mass collaboration”
Commentary by Filiz Garip about Salganik et al. 2020. “What failure to predict life outcomes can teach us.”
Replication materials for Salganik et al. 2020.
Socius Special Collection on the Fragile Families Challenge
Twitter thread summarizing findings
Article about project by Rose Huber, Princeton University: “Multi-year datasets suggest projecting outcomes of people’s lives with AI isn’t so simple”

Matt Salganik

July 15, 2019

reaching out to all participants in the Fragile Families Challenge

Uncategorized No comments

We are in the process of writing the community paper that describes the results of the Fragile Families Challenge, and we have attempted to email all participants in the Challenge. If you participated in the Challenge but have not received an email from us (perhaps because your email address has changed), please contact us by Wednesday, July 17, 2019 at 11:59pm.

Matt Salganik

May 19, 2019

Searching for a research software engineer

Uncategorized No comments

We are searching for a research software engineer to work on research related to the Fragile Families Challenge. The project will be a mix of software engineering, data science, and social science. Application review begins soon. Here is a link to the full position description: https://main-princeton.icims.com/jobs/10347/research-software-engineer/job

Matt Salganik

May 1, 2019

computational reproducibility and the Fragile Families Challenge

Uncategorized No comments

We’ve recently posted a working paper about our experiences with computational reproducibility and the Fragile Families Challenge special issue.

Successes and struggles with computational reproducibility: Lessons from the Fragile Families Challenge

David M. Liu and Matthew J. Salganik

Abstract: Reproducibility is fundamental to science, and an important component of reproducibility is computational reproducibility: the ability of a researcher to recreate the results in a published paper using the original author’s raw data and code. Although most people agree that computational reproducibility is important, it is still difficult to achieve in practice. In this paper, we describe our approach to enabling computational reproducibility for the 12 papers in this special issue of Socius about the Fragile Families Challenge. Our approach draws on two tools commonly used by professional software engineers but not widely used by academic researchers: software containers (e.g., Docker) and cloud computing (e.g., Amazon Web Services). These tools enabled us to standardize the computing environment around each submission, which will ease computational reproducibility both today and in the future. Drawing on our successes and struggles, we conclude with recommendations to authors and journals.

Matt Salganik

December 13, 2018

Making it easier to work the Fragile Families metadata in R and Python

Uncategorized No comments

One thing we noticed during the Fragile Families Challenge is that some participants struggled to work with the data and metadata. So, after the Challenge ended, we used what we learned to rebuild the Fragile Families and Child Wellbeing Study data and metadata, making it easier to use. This process is described in more detail in our paper: “Improving metadata infrastructure for complex surveys:  Insights from the Fragile Families Challenge.” Now we are happy to announce the release of an R package (ffmetadata) and a Python package (ffmetadata-py) that make it easier to work with the Fragile Families metadata without leaving your data analysis environment.

Here’s an example of how you could install and use the R package:

And, here’s an example of how you could install and use the Python package:

Basically, both of these packages are convenient wrappers around the metadata API. You can read more about the design of the API in our paper “Improving metadata infrastructure for complex surveys:  Insights from the Fragile Families Challenge.”

The R package was written by Ryan Vinh, with assistance from Ian Fellows and Will Lowe. The Python package was written by Vineet Bansal. The code for both packages is available open source:

ffmetadata (R package) and usage examples and additional information can be found in the vignette
ffmetadata-py (Python package)

Alex Kindel

November 17, 2018

Improving metadata infrastructure for complex surveys

Uncategorized No comments

Anyone who uses survey data for research purposes knows how important metadata is for developing an understanding of a dataset’s structure and meaning. One of the big things we learned from organizing the Challenge is that machine learning methods place an extraordinary demand on metadata. Using 10k variables in a single model requires new ways of reading and using metadata to accomplish necessary data preparation tasks, and many of these tasks are not easily accomplished using the metadata infrastructure that is most commonly available in the social sciences (e.g. PDF codebooks).

To summarize how we’ve tried to improve these resources and what we learned as we undertook our redesign, we wrote a paper that will appear in a forthcoming special issue of Socius about the Fragile Families Challenge. We provide a link to the paper (on SocArXiv) as well as its abstract below; any comments or questions are most welcome!

Improving metadata infrastructure for complex surveys:  Insights from the Fragile Families Challenge
Abstract: Researchers rely on metadata systems to prepare data for analysis. As the complexity of datasets increases and the breadth of data analysis practices grow, existing metadata systems can limit the efficiency and quality of data preparation. This article describes the redesign of a metadata system supporting the Fragile Families and Child Wellbeing Study based on the experiences of participants in the Fragile Families Challenge. We demonstrate how treating metadata as data—that is, releasing comprehensive information about variables in a format amenable to both automated and manual processing—can make the task of data preparation less arduous and less error-prone for all types of data analysis. We hope that our work will facilitate new applications of machine learning methods to longitudinal surveys and inspire research on data preparation in the social sciences. We have open-sourced the tools we created so that others can use and improve them.

Matt Salganik

October 19, 2018

Presentation about the Fragile Families Challenge: November 8, 2018 at Brown

Uncategorized No comments

We will be giving a talk about the Fragile Families Challenge on Thursday, November 8, 2018 in the Population Studies and Training Center (PTSC) Seminar Series at Brown University. The talk will be in Mencoff Hall, Seminar Room 205 at noon. The talk is open to everyone.

Here’s more information: https://www.brown.edu/academics/population-studies/event/fragile-families-challenge

Matt Salganik

October 19, 2018

Presentation about the Fragile Families Challenge: October 22, 2018 at Princeton

Uncategorized No comments

We will be giving a talk about the Fragile Families Challenge on Monday, October 22, 2018 at 4:00 pm in the Program in Applied and Computational Mathematics (PACM) Colloquium at Princeton University. The talk will be in Fine Hall, Room 214. This talk is open to everyone.

Here’s more information: https://www.pacm.princeton.edu/events/pacm-colloquium-matthew-salganik-princeton-university

Ian Lundberg

September 14, 2018

Privacy, ethics, and data access: A case study of the Fragile Families Challenge

Uncategorized 1 comment

This blog post summarizes a paper describing the privacy and ethics process by which we organized the Fragile Families Challenge. The paper will appear in a special issue of the journal Socius. This post is cross-posted on the Freedom to Tinker blog.

Academic researchers, companies, and governments holding data face a fundamental tension between risk to respondents and benefits to science. On one hand, these data custodians might like to share data with a wide and diverse set of researchers in order to maximize possible benefits to science. On the other hand, the data custodians might like to keep data locked away in order to protect the privacy of those whose information is in the data. Our paper is about the process we used to handle this fundamental tension in one particular setting: the Fragile Families Challenge, a scientific mass collaboration designed to yield insights that could improve the lives of disadvantaged children in the United States. We wrote this paper not because we believe we eliminated privacy risk, but because others might benefit from our process and improve upon it.

One scientific objective of the Fragile Families Challenge was to maximize predictive performance of adolescent outcomes (i.e. high school GPA) measured at approximately age 15 given a set of background variables measured from birth through age 9. We aimed to do so using the Common Task Framework (see Donoho 2017, Section 6): we would share data with researchers who would build predictive models using observed outcomes for half of the cases (the training set). These researchers would receive instant feedback on out-of-sample performance in ⅛ of the cases (the leaderboard set) and ultimately be evaluated by performance in ⅜ of the cases which we would keep hidden until the end of the Challenge (the holdout set). If scientific benefit was the only goal, the optimal design might be to include every possible variable in the background set and share with anyone who wanted access with no restrictions.

Scientific benefit may be maximized by sharing data widely, but risk to respondents is also maximized by doing so. Although methods of data sharing with provable privacy guarantees are an active area of research, we believed that solutions that could offer provable guarantees were not possible in our setting without a substantial loss of scientific benefit (see preprint section 2.4). Instead, we engaged in a privacy and ethics process that involved threat modeling, threat mitigation, and third-party guidance, all undertaken within an ethical framework.

Threat modeling

Our primary concern was a risk of re-identification. Although our data did not contain obvious identifiers, we worried that an adversary could find an auxiliary dataset containing identifiers as well as key variables also present in our data. If so, they could link our dataset to the identifiers (either perfectly or probabilistically) to re-identify at least some rows in the data. For instance, Sweeney (2002) was able to re-identify Massachusetts medical records data by linking to identified voter records using the shared variables zip code, date of birth, and sex. Given the vast number of auxiliary datasets (red) that exist now or may exist in the future, it is likely that some research datasets (blue) could be re-identified. It is difficult to know in advance which key variables (purple) may aid the adversary in this task.

To make our worries concrete, we engaged in threat modeling: we reasoned about who might have both (a) the capability to conduct such an attack and (b) the incentive to do so. We even tried to attack our own data. Through this process we identified five main threats (the rows in the figure below). A privacy researcher, for instance, would likely have the skills to re-identify the data if they could find auxiliary data to do so, and might also have an incentive to re-identify, perhaps to publish a paper arguing that we had been too cavalier about privacy concerns. A nosy neighbor who knew someone in the data might be able to find that individual’s case in order to learn information about their friend which they did not already know. We also worried about other threats that are detailed in the full paper.

Threat mitigation

To mitigate threats, we took several steps to (a) reduce the likelihood of re-identification and to (b) reduce the risk of harm in the event of re-identification. While some of these defenses were statistical (i.e. modifications to data designed to support aims [a] and [b]), many instead focused on social norms and other aspects of the project that are more difficult to quantify. For instance, we structured the Challenge with no monetary prize, to reduce an incentive to re-identify the data in order to produce remarkably good predictions. We used careful language and avoided making extreme claims to have “anonymized” the data, thereby reducing the incentive for a privacy researcher to correct us. We used an application process to only share the data with those likely to contribute to the scientific goals of the project, and we included an ethical appeal in which potential participants learned about the importance of respecting the privacy of respondents and agreed to use the data ethically. None of these mitigations eliminated the risk, but they all helped to shift the balance of risks and benefits of data sharing in a way consistent with ethical use of the data. The figure below lists our main mitigations (columns), with check marks to indicate the threats (rows) against which they might be effective. The circled check mark indicates the mitigation that we thought would be most effective against that particular adversary.

Third-party guidance

A small group of researchers highly committed to a project can easily convince themselves that they are behaving ethically, even if an outsider would recognize flaws in their logic. To avoid groupthink, we conducted the Challenge under the guidance of third parties. The entire process was conducted under the oversight and approval of the Institutional Review Board of Princeton University, a requirement for social science research involving human subjects. To go beyond what was required, we additionally formed a Board of Advisers to review our plan and offer advice. This Board included experts from a wide range of fields.

Beyond the Board, we solicited informal outside advice from a diverse set of anyone we could talk to who might have thoughts about the process, and this proved valuable. For example, at the advice of someone with experience planning high-risk operations in the military, we developed a response plan in case something went wrong. Having this plan in place meant that we could respond quickly and forcefully should something unexpected have occurred.

Ethics

After the process outlined above, we still faced an ethical question: should we share the data and proceed with the Fragile Families Challenge? This was a deep and complex question to which a fully satisfactory answer was likely to be elusive. Much of our final decision drew on the principles of the Belmont Report, a standard set of principles used in social science research ethics. While not perfect, the Belmont Report serves as a reasonable benchmark because it is the standard that has developed in the scientific community regarding human subjects research. The first principle in the Belmont Report is respect for persons. Because families in the Fragile Families Study had consented for their data to be used for research, sharing the data with researchers in a scientific project agreed with this principle. The second principle is beneficence, which requires that the risks of research be balanced against potential benefits. The threat mitigations we carried out were designed with beneficence in mind. The third principle is justice: that the benefits of research should flow to a similar population that bears the risks. Our sample included many disadvantaged urban American families, and the scientific benefits of the research might ultimately inform better policies to help those in similar situations. It would be wrong to exclude this population from the potential to benefit from research, so we reasoned that the project was in line with the principle of justice. For all of these reasons, we decided with our Board of Advisers that proceeding with the project would be ethical.

Conclusion

To unlock the power of data while also respecting respondent privacy, those providing access to data must navigate the fundamental tension between scientific benefits and risk to respondents. Our process did not offer provable privacy guarantees, and it was not perfect. Nevertheless, our process to address this tension may be useful to others in similar situations as data stewards. We believe the core principles of threat modeling, threat mitigation, and third-party guidance within an ethical framework will be essential to such a task, and we look forward to learning from others in the future who build on what we have done to improve the process of navigating this tension.

You can read more about our process in our pre-print: Lundberg, Narayanan, Levy, and Salganik (2018) “Privacy, ethics, and data access: A case study of the Fragile Families Challenge.”

Matt Salganik

July 30, 2018

Fragile Families Challenge at the American Sociological Association Annual Meeting

Uncategorized No comments

We will be presented three papers about the Fragile Families Challenge at the 2018 American Sociological Association Annual Meeting, which will be in Philadelphia August 11–14. Please come to these sessions to learn more about the Challenge.

Data-driven Data Provision: A Case Study from the Fragile Families Challenge
Sun, August 12, 8:30 to 10:10am, Philadelphia Marriott Downtown, Level 4, 404

Metadata provides critical support for researchers working with public datasets, but new methods at times outgrow what existing data infrastructure is able to support. This paper describes what happened when a large, heterogeneous group of researchers used a complex social data set in a way that was not originally envisioned by its creators. Using the Fragile Families Challenge as a case study, we identify five strategic areas where improving metadata — variable names, response codes, cross-questionnaire matching, concept tags, and release format — can make data use easier for everyone. More generally, we illustrate some of the unintentional and invisible barriers that are preventing the use of machine learning methods in the social sciences, and suggest that data system design is a fundamental research problem for the field of computational social science.

The Fragile Families Challenge: Predictability of Family and Child Well-being in Adolescence
Sun, August 12, 10:30am to 12:10pm, Philadelphia Marriott Downtown, Level 4, Franklin Hall 8

Scholars have long hypothesized that childhood experiences play an important role in the process by which socioeconomic status is reproduced across generations. The predictive power of attainment models, however, has been so weak that pioneers of the field have commented that random chance must play an important role. We hypothesize another possible source of poor predictive performance: untapped modeling potential. Modern machine learning approaches often yield better predictions than parametric regression models, yet social scientists have not fully exploited this opportunity. In this paper, we report on how 159 research teams from 68 institutions in 7 countries used rich survey data covering 2,121 training observations on 12,942 variables to produce predictive models that together set a benchmark of predictive performance for outcomes identified by social scientists as important factors in the status attainment process. We narrow our focus to a critical point of the life course: predicting adolescent outcomes as a function of childhood experiences. Each team developed a predictive model that was evaluated on a set of outcome observations available only to the organizers. Results suggest that (a) predictive performance outpaced approaches more common in social science, but (b) overall predictive performance was poor. We close with a discussion of the potential reasons for poor predictive performance in social science research. Given the theoretical importance of childhood experiences in the process of stratification, our results should be of interest to scholars of stratification, socio-economic mobility, child development, and statistical methods.

Privacy, Ethics, and Computational Social Science: A Case Study of the Fragile Families Challenge
Mon, August 13, 4:30 to 6:10pm, Philadelphia Marriott Downtown, Level 4, Franklin Hall 12

New sources of “big data” created by companies and governments hold great promise for advancing social science. Unfortunately, a fundamental barrier preventing researchers from achieving this promise is data access. Quite simply, most big data sources are not accessible to researchers. Therefore, developing procedures that enable safe and ethical data access represent an important methodological problem in computational social science. In this paper, we present our process for enabling data access during the Fragile Families Challenge, a scientific mass collaboration designed to improve the lives of disadvantaged children in the United States. We describe our process of threat modeling, threat mitigation, and third-party oversight. We also describe the ethical principles that formed the basis of our process. Ultimately, we hope that the approach that we developed will be helpful to researchers who seek data access and data custodians who wish to provide data access.

Matt Salganik

July 30, 2018

Fragile Families Challenge and Princeton AI4ALL

Uncategorized No comments

AI4ALL is a non-profit educational program designed to increase the diversity of researchers working in AI. This summer there is an AI4ALL program at Princeton, and one of the projects participants will work on in a modified version of the Fragile Families Challenge. We look forward to seeing what new insights these students bring to the Challenge.

If you would like to use a modified version of the Fragile Families Challenge in your teaching or research, please contact us.

Matt Salganik

January 26, 2018

computational reproducibility and the Fragile Families Challenge special issue

Uncategorized No comments

We are currently editing a special issue of Socius about the Challenge. For this special issue, we are striving for a standard of computational reproducibility, which means that other researchers should be about to recreate the results in all of the papers. Therefore, while the manuscripts have been undergoing peer review, we have also been attempting to replicate the results in each paper. This has turned out to be trickier than we expected. In this post, I’d like to briefly summarize what we’ve done so far, and then share a set of guidelines that we’ve developed and shared with our authors. If you have ideas for how these guidelines can be improved, please let us know. Ultimately, we hope that the guidelines will be a helpful resource for authors and editors who wish to promote computational reproducibility, either in their own work or the work of others.

Our replication efforts have been spearheaded by David Liu, and this work will be part of his senior thesis in Princeton’s Department of Computer Science. In attempting to replicate the results of each paper, David has noticed helpful things that some authors have done, and he’s found some problems that come up over and over. Therefore, when we sent back decisions on the manuscripts, we also sent the feedback below on code. Just as authors have to revise and resubmit their manuscripts, for the special issue, authors will have to revise and submit their code. These guidelines are intended to help with that process.

Background behind reproducibility guidelines

First, we’d like to step back from the details to describe the high-level goal. We want your articles to be computationally reproducible, which means that another researcher could regenerate the results in your paper using the Challenge data, your code, and any additional data that you have created. Computational reproducibility will increase the impact of your work individually, and it will increase the contribution of the Challenge collectively.

As we’ve learned during this first round of reviews, the goal of computational reproducibility is widely shared by scientists, easy to state, and tricky to achieve. Based on what we’ve learned from your code, our thinking on how to achieve this goal issues has evolved. In particular, we’ve been very influenced by the idea of a “research pipeline” described by Peng and Eckel (2014), which is nicely captured by this figure: http://bit.ly/2qrTWXK.

The goal of this document is to provide you with guidelines that support computational reproducibility of your entire research pipeline, which goes from raw data to final output. You don’t have to follow these guidelines exactly; if you devise a system that you think is better, you are welcome to use it. But, if you have no system in place, we are going to strongly encourage that you adopt these guidelines.

The Guidelines

The most important thing to keep in mind is that we are asking you to create one single script named “run_all” that executes all the necessary files to go from the raw data to the final results. One way to do this is to write a bash script that calls the submission files in sequence. An example of a simple bash script is shown below:

Running the above script will execute each line, one after another. Note that the screen shot includes examples for many common languages. More background information on writing bash scripts is available at: https://ryanstutorials.net/bash-scripting-tutorial/bash-script.php. Of course, you may write the run_all script in the language of your choice so long as it can be executed from the command line.

While you are creating this script, we think it will be helpful to organize your input files, intermediate files, and output files into a standard directory (i.e., folder) structure. We think that this structure would help you create a modular research pipeline; see Peng and Eckel (2014) for more on the modularity of a research pipeline. This modular pipeline will make it easier for us to ensure computational reproducibility, and it will make it easy for other researchers to understand, re-use, and improve your code.

Here’s a basic structure that we think might work well for this project and others:

data/
code/
output/
README
LICENSE

In the data/ directory you can include:

background.csv (this should not actually be included because of privacy constraints, but we will put it here)
train.csv (this should not actually be included because of privacy constraints, but we will put it here)
Supplemental materials such as metadata files, the constructed-data dictionary, the machine-readable codebook.
Data that you have collected or created, such as a csv file that you manually created that has your MSE scores on the holdout data and/or an analytic dataset created by your code.

In your code/ directory you can include:

Executable “run all” script that when run goes from raw inputs all the way to final outputs (for this script we encourage you to think about the research pipeline idea from Peng and Eckel 2014: http://bit.ly/2qrTWXK)
Source code files each with a useful header (see FAQ).
Package requirements

For python submissions, please include a requirements.txt file. More information at: https://pip.pypa.io/en/stable/user_guide/#requirements-files
For R submissions, please list all libraries utilized in a file named requirements_r.txt. Include each library name on a new line.

In your output/ directory you can include:

prediction.csv
A subdirectory for tables
A subdirectory for figures (we also recommend including all data files that can be used to recreate the figures; see rule 7 of Sandve et al. 2013)

In addition to these three main directories, you should also include a README file and LICENSE file. We have more information about these files in the FAQ below. We hope that these guidelines are help, and please let us know if you have any questions.

Code Resubmission Process

Once you think you are ready to resubmit, here’s a checklist that you can follow to help ensure that your work will be computationally reproducible:

I have written the kind of README file that I would like to read (see FAQ below)
Each code file that I’ve written has a header that will be helpful (see FAQ below)
I’ve run the submission and I can get from raw files to final output using only materials in my directories. Then, I’ve done this again and I get the same result. This second step helps check for problems with seeding.
I’ve considered refactoring my code (see FAQ below)

Finally, when you resubmit, we ask that you include a revision memo about the code, just as you will about the manuscript. This revision memo should summarize changes that you have made. In this revision memo, please also include a rough estimate of the cumulative amount of time it took you to comply with these guidelines. We are asking for this time estimate because one objection to computational reproducibility is that it is too burdensome for authors and we would like to assess this empirically. Finally, please include any suggestions for how this process could have been easier or more efficient.

F.A.Q.

What should go in the README file?

The README file should provide an overview of your code. For example, it could include a diagram showing the different pieces of their code, their inputs and their outputs. If relevant, please include expected warnings when executing the code. Mention any provided “intermediate results” readers can utilize to decompose the submission into smaller pieces.

The README should also include something about your computing environment and expected run time; general terms are appropriate here. For example: “I ran this on a modern laptop (circa 2016) and it ran in a few minutes.” or “This code ran on high-performance cluster and took one week.” Finally, please clearly cite any open sourced content utilized in the submission, such as resources shared in the FFC blog or more general packages distributed in the computation community.

What headers should be included at the top of each piece of code?

Based on the ideas in Nagler (1995), we think the following elements should be included at the top of each piece of code:

Purpose (in 140 characters or less)
Inputs
Outputs
Machine used (e.g., laptop, desktop, cluster)
Expected runtime (e.g., seconds, minutes, hours, days, etc)
Set the seed at the beginning of each file (see rule 6 of Sandve et al. 2013)
All the package include statements (e.g., “library(ggplot2)” in R)

If you would like to deviate from this standard, please contact us.

How can I make my code easier to read?

It is hard to offer general advice, but one thing that we can recommend is at the end of the process take some time to refactor your code (https://en.wikipedia.org/wiki/Code_refactoring). In our experience, code evolves over the course of a project, and at the end it can be helpful to refactor in order to clean up the structure, improve variable names, and promote modularity.

Even if you don’t refactor your code, please include additional comments to helper functions and code segments that may be obscure to new readers.

What is our standard for computational reproducibility for the special issue?

Our standard for computational reproducibility for this special issue is that we should be able to take whatever code and data you submit, add the Fragile Families Challenge data file, and then reproduce all of the figures in your paper, all of the tables in your paper, and your predictions.csv file.

What is not included in our standard for computational reproducibility for the special issue?

We will not attempt to completely recreate your analysis from the written materials. Also, we will not verify that your description in the paper matches the code. For example, if the paper says that you use logistic regression to generate your predictions, we will not verify that the code also uses logistic regression. Further, we will not verify the information that you have provided from external sources. For example, if you write in the paper that your submission was 10th on the leaderboard, we will not verify this fact. Finally, we will not verify any of the numbers that are included in the text of the manuscript. For example, we would not verify a claim in the text such as: dropping variables with no variation removes 10% of variables. As we hope this list illustrates, our standard of computational reproducibility is in fact quite limited.

What license should I use?

We strongly recommend the MIT license. You can find it here: https://opensource.org/licenses/MIT. Simply replace with 2018 and with the name of all co-authors of the paper, in the order they are listed in the paper. If you would like to use some other license, please contact us.

What should I read to learn more about computational reproducibility?

Here’s a partial list. If we’ve left off a good resource, please let us know (fragilefamilieschallenge@gmail.com).

Nagler (1995) “Coding Style Good Computing Practices” PS: Political Science & Politics. (open access version)

Peng and Eckel (2009) “Distributed Reproducible Research Using Cached Computations” Computing in Science & Engineering.

Sandvae et al (2013) “Ten Simple Rules for Reproducible Computational Research” PLOS Computational Biology.

Stodden et al (2016) “Enhancing reproducibility for computational methods” Science.

Matt Salganik

January 24, 2018

Fragile Families Challenge special issue feedback

Uncategorized No comments

We’ve recently completed the first round of reviews for papers in the special issue of Socius about the Fragile Families Challenge. There were many really interesting manuscripts submitted, but there were a variety of issues that came up repeatedly in the reviews. Therefore, in addition to providing feedback on each manuscript individually, we also developed some overall feedback that we provided to all authors. We are posting that feedback here in the hopes that it might help others who are planning to run a mass collaboration and publish a special issue.

Feedback to all authors

Based on our reading of all submissions and all reviews, we are encouraging all authors submitting revisions to the special issue to give extra attention to reviewer comments in the following three areas:

1) Accuracy. We are encouraging all revisions to focus on more clearly describing what they did, why they did it, and what might be learned from it. You must accurately report what you did. When reviewers ask why you did something, this is an important question to address. For the purpose of the special issue, you do not always need a formal justification for making a decision; if you just thought it seemed reasonable, you should say that.

In addition, we are encouraging all authors to clearly report all of their results, not just those that make their approach look more promising. When deciding whether to publish the paper, a major factor for us will be whether the paper communicates clearly the strengths and weaknesses of the approach. This factor will be much more important than whether the results are “interesting” or “promising.” Any reviewer comments about selective reporting are especially important to address.

If you used an approach that required tuning parameters (e.g., the lambda parameter in LASSO), please say how you set the parameters. The most common approaches seem to be cross-validation or using the defaults in the software. This should be clear in the papers.

2) Interest. A reader of your paper should quickly see why it would be of interest to some social scientists or some data scientists. We encourage you to add a few sentences in the introduction that that clarifies what you think are the most interesting or important ideas or results in your paper. Again, we think this will be helpful given the interdisciplinary nature of the readership. Also, if you think the main contribution is to establish the baseline against which future efforts can be compared, we think that is an important contribution.

3) Presentation. It is very important that the special issue be readable for both data scientists and social scientists. These communities sometimes use different language, and we have sought reviewers from both cultures. When reviewers are confused about something common in your field, realize that an extra sentence or reference might make the paper more readable to a diverse audience, thereby increasing the impact of your paper.

Also, inconsistent terminology often stands in the way of effective presentation. Be careful that your manuscript uses internally consistent terminology. One recommendation to promote consistency is to choose a book or an authoritative article and use its terminology. This way, terminology will be internally consistent, and confused readers are immediately pointed toward a source that can help them understand.

Stepping back from these three areas of focus, we would like to remind authors that the use of online supporting material can greatly improve accuracy, interest, and presentation. Yet very few of the manuscripts used this opportunity. Online supporting materials can be arbitrarily long and provide an opportunity to be clear about even the most mundane decisions (accuracy), reduce clutter in the paper so that non-specialists can follow the main ideas (interest), and provide an outlet to share details with researchers who wish to understand and build on your work (presentation). If there is part of your paper that will be of interest to only a small subset of readers, we strongly encourage you to put this information in the online supporting materials.

Based on our reading of all submissions and all reviews, we are encouraging all authors submitting revisions to the special issue to make certain formatting changes:

1) In the acknowledgements, you should list and cite the software that you use. This will promote reproducibility and give academic credit to folks that create software. We recommend these two sentences like this: “The results in this paper were created with software written in R 3.3.3 (R Core Team, 2017) using the following packages: ggplot2 2.2.1 (Wickham, 2009), broom 0.4.2 (Robinson, 2017), and caret 6.0-78 (Kohn, 2017). Replication code for this article is available at [ url coming soon, we are still exploring permanent homes for your code ].” If you would like to learn more about citations in R, we recommend: http://www.blopig.com/blog/2013/07/citing-r-packages-in-your-thesispaperassignments/ If you would like to learn more about citations in Python, we recommend: https://www.scipy.org/citing.html. We realize that citation standards for software are still evolving, so please ask if you have any questions.

2) Each of your papers should acknowledge the funders of the Fragile Families and Child Wellbeing Study and the funders of the Fragile Families Challenge. Therefore, we ask you add these sentences to the acknowledgements section of your paper: “Funding for the Fragile Families and Child Wellbeing Study was provided by the Eunice Kennedy Shriver National Institute of Child Health and Human Development through grants R01HD36916, R01HD39135, and R01HD40421 and by a consortium of private foundations, including the Robert Wood Johnson Foundation. Funding for the Fragile Families Challenge was provided by the Russell Sage Foundation.”

3) Several reviewers who were not part of the Challenge found the papers slightly confusing. Although we previously told you not to describe the Challenge, we think that was a mistake. You are writing a paper for a special issue of a journal, not a book chapter. Therefore, we would ask that you add one paragraph in the introduction of your paper providing a brief overview of the Challenge. Obviously the entire Challenge cannot be described in one paragraph, so you can cite our introduction to the special issue to provide more information. For now you can cite the introduction as Salganik, Lundberg, Kindel, and McLanahan “Introduction to the special issue on the Fragile Families Challenge.” We think this change will help make the articles more self-contained and will therefore increase their impact.

4) We encourage you to add a single paragraph in the introduction section of your paper that provides a roadmap to your paper. For example, “In Section 2 we describe our approach to data preparation. Then, in Section 3 we describe our procedure for variable selection. In Section 4, we describe the different models we used for prediction and compare their performance. In Section 5, we attempt to interpret the predictive models. The paper conclusions with recommendations for future research.” Although many short papers do not require this kind of roadmap, we think that it will be helpful given the interdisciplinary nature of the readership.

We are offering the two forms of support below to help you write the best paper possible.

1) Additional analyses. If the authors would like to undertake additional analyses that would require access to the holdout data, we would be happy to help facilitate that so long as all results are reported in the paper as post-Challenge results.

2) Talk with editors. We believe an open exchange often produces the best papers. If you have any questions please email us (fragilefamilieschallenge@gmail.com). If the authors would like to talk to us after having read through the reviews and charted a plan for the revisions, feel free to email us, and we would be happy arrange that.

Regarding your code, some of you have already heard from us about our efforts to reproduce your results and others will hear from us soon. We hope that while you are revising and improving your paper, you will also revise and improve your code. You will receive more specific instructions from us soon.

Matt Salganik

January 24, 2018

Fragile Families Challenge data are now available

Uncategorized No comments

Researchers interested in using the data from the Fragile Families Challenge can now apply for access through the Office of Population Research data archive. We hope that this data will be used to replicate and extend research conducted during the Challenge. We also hope that this data will be used in teaching. Many participants in the Challenge began working on it in a class, and we’ve heard from professors that the Challenge provides a great learning opportunity.

FAQ

How long does the application process take?

The application is relatively short, and we expect that most applicants will have a response in 2 business days or less.

Do you have any resources that can help me use this in my class?

Students can watch the talks from the Fragile Families Challenge Scientific Workshop, and they can also watch a video of one of our getting started workshops, which motivates the Challenge and explains the data. Further, we think that many of the resources listed here may be helpful. Finally, once it is published, we think that the special issue of Socius about the Challenge will be especially helpful for students.

Are you going to allow new submissions on your CodaLab platform?

We used the open source CodaLab to manage the process of processing submissions. Unfortunately, the instance of CodaLab that we used during the Challenge will no long accept submissions. However, you can set up your own version of the modified form of CodaLab that we used; all our code is on github. If you don’t want to manage CodaLab yourself and you don’t have access to a developer, we can put you in touch with the developer that did this work for us.

Ian Lundberg

December 21, 2017

Scientific Workshop Breakout Sessions

Uncategorized No comments

At the Fragile Families Challenge Scientific Workshop, we devoted Thursday afternoon to breakout sessions where participants could work on specific projects that grew out of the Challenge. In this blog post we’d like to describe what happened in the breakout sessions, what we learned, and what is going to happen going forward.

Demo for the new Fragile Families metadata API and front-end (led by Maya Phillips and Alex Kindel)

Background: One of the things that we discovered during the Challenge is that much of the metadata about the Fragile Families study is stored in pdf files of codebooks that are designed to be read by people and are not designed to be machine-actionable (in the sense that they are easy to process with code). During the Challenge, one of the participants—Gregory Gundersen—converted some of the existing documentation into a metadata API. We loved his idea so much that the Board awarded him the Foundational Prize, and we decided to try to build on what he did by creating more metadata, a more fully featured API, and a web front-end for the API.

The API workshop began with a brief presentation about the interface proof-of-concept, its intended audience, and the different design decisions that we made along the way. Participants provided positive feedback, which gave us confidence in our design decisions. We determined that providing an API independent of the front-end enables the widest audience to make use of the data: both for more technical users invoking the API directly and for less technical users relying on the guided functionality of the front-end. The workshop proceeded to discuss some of the core functionality before doing a brief code walkthrough and demonstration of the front-end’s main features.

API discussion led by Maya Phillips and Alex Kindel

Discussing the different API functions helped to affirm the idea that complex queries could be created by chaining simple functions together (e.g. search variables, display variable). Therefore, we will build a few simple functions that are designed to fit together, rather than many specific functions for different use cases. For example, we plan to enable Boolean searches over the metadata fields, making it possible to quickly combine multiple searches. We also discussed the different ways queries can be made to the API: through a local copy, through Python or R libraries, or through web requests to a remote server. Providing a web server presents some trade-offs (e.g. between consistency guarantees and query speed), so we sought feedback on how users might think about this tradeoff. Given that the typical use case will involve only a couple of queries, we determined that consistency was more important than query speed, but we intend to provide a full CSV copy of the metadata in the event that researchers need to perform more intensive metadata analysis.

Participants were hesitant about the use of PHP to implement the API. Although we initially chose PHP for the backend in order to coordinate with local web development and maintenance resources at Princeton, the workshop attendees stressed the importance of a community-maintainable and open-source code base to which others could add useful features and make updates as the technologies advance. Moving forward, we intend to use a modern web stack to design the API and front-end for the revised metadata. It seems that this is critically important to those who will be using the API in the future, and is the most in line with the vision the team has for the project moving forward.

Workshop participants were generally enthusiastic about the demo version of the metadata browser front-end. The demo was designed to prototype interactions with the basic features of the API. The group had several feature requests for the web interface, primarily revolving around search:

An option to save and aggregate multiple searches
An option to copy and paste metadata (especially variable names) directly into code
Logging user queries to provide the FFCWS community with additional information on which variables are being used
Supplying data users with more information on possible responses, especially missing values
Displaying additional information on variable groups in the search interface
Naming queries to provide easy, commonly used shortcuts into the search interface
Enabling search over all data fields, including tags and responses

We intend to implement several of these features before the public launch of the website.

Figure 1. Variable browser search interface at the time of the workshop.

Figure 2. Variable metadata page at the time of the workshop.

Maya will continue to develop the back-end codebase by doing rigorous testing and adding features that will support the new features requested on the front-end. Alex and Maya will continue to work closely together as they build and integrate this architecture. In particular, a near-term goal is to identify and implement a relational schema for storing a canonical copy of the metadata; this database will serve the API, which in turn will serve the front-end and related software packages.

Testing and improving the Docker container that ensure reproducibility of the special issue of Socius (led by David Liu)

Background: One of the goals of the Fragile Families Challenge is to help promote a culture of open and reproducible research. Therefore, we required all participants in the Challenge to open source their submissions (code, predictions, and narrative explanations). This does not, however, ensure that it is easy for future researchers to reproduce the results of any of the Challenge submissions because we never checked that the code actually ran and future researchers may lack the appropriate dependencies to get the code to run. Therefore, for all the manuscripts in the special issue of Socius, we are ensuring reproducibility by re-running the code as part of the review process and packaging up the code and all the necessary dependencies in Docker containers that will make it easy for future researchers to download and run the code used in the papers in the special issue.

The reproducibility session began with a brief presentation of the project’s motivations and progress thus far. In addition to discussing the working of Docker, David Liu discussed patterns he has noticed in submitted code as well as a few suggested best practices for reproducibility.
It was particularly helpful to explain the background of Docker containers to the audience. For some, the explanation clarified the technical workings; the questions David received helped him better anticipate potential areas of confusion, which will be useful when releasing the containers to the public. A useful take away from the discussion on Docker is that the reproducibility work stands at the intersection of both research and software engineering; many of the principles and best practices of software engineering are relevant to conducting reproducible research. Examples include writing code that modular, assembling documentation during development, and testing the code as it is being written.
Next, David walked through a demo of how an actual submission was reproduced. This demo was particularly fruitful because the author of the submission (Tom Davidson) was in the audience and provided helpful commentary regarding his submission, which utilized neural networks. The demo illustrated how one would run Docker on Amazon Web Services and interact with the code. One of the undergraduates in attendance was able to follow the demo.
Overall, the session reinforced the community’s interest in viewing the open-sourced submissions. It was apparent that submitters were curious to see how others developed and implemented their models, beyond just the results themselves. In discussing and critiquing the code itself, we were able to better understand the author’s intentions and learn from their code development process. So reproducing and publicly publishing the code will satisfy a research need.
Looking ahead, David is reaching completion with five of the thirteen submissions written in Python and R, and he intends on completing the reproducibility work over the course of December and January. In the end, David will open source Docker containers for each of the journal’s models and include basic documentation regarding usage of the code. In addition, David will be able to provide recommendations for future social research software development to optimize code reproducibility. Finally, David can provide tips and guidelines for other journal editors on how to best reproduce journal submissions while also establishing a baseline for expected time commitment.

David Liu leading a discussion about reproducibility.

Assessing test-retest reliability of concept tags (led by Kristin Catena)

Background: One of the difficulties that participants encountered during the Challenge was selecting from the many, many survey questions that were available. Several participants asked us for a list of all questions related to education, for example. Such a list was not available. Therefore, we are now tagging all variables in the dataset with the social science concepts that they are attempting to measure.

As part of our new work on the FFCWS metadata infrastructure, we are adding a system of concept tags to the FFCWS variables so that users may more easily identify a list of variables related to a particular topic. For example, the concept tags would allow a data user to quickly identify all of the variables related to mental health or all the variables related to child support. It would also mark variables that are considered paradata – data that is about the survey administration but may not contain substantive information about the family (e.g., survey date, whether a particular participant completed a specific survey, etc.). Each variable will be assigned one or more concept tag(s) which will also be grouped into larger “umbrella” concepts. For example, mental health will be grouped under an umbrella of health while child support will be grouped under an umbrella of finances. When complete, the concept tags will be available through the metadata API and website (described above).

At the Fragile Families Challenge Scientific Workshop, we held a breakout session to test and discuss the concept tag system. Each participant was given a questionnaire to code into a provided list of tags. We also saved time afterwards to discuss the process and list of tags. In general, the participants reported that the concept tag list would be very helpful to data users and that they thought the list includes the concepts they would hope to search for. Several participants from data science backgrounds noted that they thought the umbrella concepts would be very helpful for orienting their work as they got started, but that the specific concept tags would be less helpful for them. Participants with more of a social science background, on the other hand, were interested in both the umbrella concepts and specific concept tags.
After the workshop, the participants’ tags were compared with those assigned by content experts from the FFCWS staff. 205 variables from four different FFCWS surveys were each coded by a member of the FFCWS staff and two different participants of the workshop breakout session. 95% of all variables coded had at least one tester who tagged the variable with the same concept as the FFCWS staff. Further, in 60% of all cases the FFCWS staff and both testers applied the same tag to the variable. Only 11 of the 205 variables had zero agreement between testers and FFCWS staff. We are now reviewing these results to strengthen and clarify the list of concept tags before completing the process of assigning tags to all remaining FFCWS variables.

Garrett Pace tagging variables with concept tags

Ian Lundberg tagging variables with concept tags

Steve McKay tagging variables with concept tags

Liberating question and answer texts from pdfs (led by Tom Hartshorne)

Background: As described above, one of the lessons from the Challenge was that we wanted to make more of the metadata available in machine-actionable formats. One example of this is the actual text of the survey questions. Right now, that information is currently in many different pdf files, which makes it cumbersome to search efficiently. Therefore, we want to make it easier for people to search and process the exact text of each question.

The goal of this project was to extract the exact text of the questionnaires out of PDF form and into a machine readable csv formatted file. This would allow future researchers to efficiently locate questions that have a certain keyword in either the question itself or the possible responses. It would also allow for our API and website (described above) to return the exact wording of the question associated with a particular variable.

Tom Hartshorne pitched the question text task to the group

Prior to the workshop, a procedure was iteratively developed to extract the text manually, but this proved time consuming. One of the goals going into the workshop was to use the collective expertise of the community to try and develop an automated way of scraping these PDF’s. During the workshop’s afternoon breakout session, Tom Hartshorne introduced the problem to a group of Challenge participants. Some members of the group worked to sharpen the manual process by going through it themselves and pointing out additional information stored in the questionnaires that could be useful to researchers. For example, Nicole Carnegie proposed adding the skip pattern associated with each answer choice. This is something that had come up in her Challenge experience, but had not been considered by the Fragile Families team prior to the Workshop. The manual process was very helpful for understanding the nuances of the questionnaires such as the string of periods between each answer choice, the formatting of “Circle all that apply” questions, and the location of the skip pattern information.

Question text working group led by Tom Hartshorne

While some of the group worked on the manual process, others worked towards a possible automation of the process. Cambria Naslund led this group of members, writing a Python script that parses an HTML version of the questionnaire to liberate the question text. This code strips the variables of their prefixes, then searches the questionnaire for a matching name. Once it finds a match, it grabs all the text following the first paragraph break up to the last paragraph break before the next question. It then cleans up this text, separating the answers from the question text using the long string of periods found between each answer choice. Each of these answers is stored in its own column, along with any skip patterns that may be associated with that answer choice. The output of this code will still require some manual cleaning, but it should greatly shorten the manual effort required by this task. We’ve now moved the software development to GitHub.

Conclusion

Overall, it was a very productive afternoon. We want to again thank everyone that participated.

Leah Gillion talking with Sara McLanahan

Coffee was available in abundance

Matt Salganik

December 18, 2017

videos from the Fragile Families Challenge Scientific Workshop

Uncategorized No comments

The Fragile Families Challenge Scientific Workshop was a two-day event. On the first day, authors led small group discussions about their papers for the special issue of Socius, and we had a number of breakout projects. Then, on the second day, there were presentations from prize winners. These talked were livestreamed, and I’m happy to now announce that the videos of these talks are now available. We hope that you enjoy them as much as we did.

Ian Lundberg

December 1, 2017

Media coverage of the Fragile Families Challenge by Princeton University

Uncategorized No comments

The Fragile Families Challenge has been featured in a post by the Princeton University Office of Communications. Read about where we are and what we’ve learned so far!

We also held a scientific workshop on the Fragile Families Challenge Nov. 16-17 at Princeton University. Many participants came and we have learned a lot about the models submitted to the Challenge. Watch the blog for a video link coming soon with recordings from the workshop!

Ian Lundberg

November 14, 2017

MDRC’s Approach to the Fragile Families Challenge

Uncategorized No comments

Guest post by Kristin E. Porter, Tejomay Gadgil, Sara Schell, Megan McCormick and Richard Hendra, MDRC.

Predictive analytics at MDRC

For more than 40 years, MDRC, a nonprofit, nonpartisan education and social policy research organization, has been a leader in pioneering the most rigorous research methods in social science research and in sharing what we have learned with the field. In this blog post, we describe how MDRC’s rigorous approach to methodology and data processing is reflected in our approach to predictive analytics, which we believe led to our first place performance in the two Fragile Families Challenge domains where we submitted models.

MDRC works with a wide variety of government agencies, nonprofits, and other social service providers to help them harness their data to better understand patterns of behavior, figure out what works, better manage caseload dynamics, and better target individuals for interventions. In particular, we are using predictive analytics to identify individuals’ likelihoods of achieving key outcomes, such as reaching a program participation milestone, finding employment, or reading at a proficient level.

MDRC researchers have developed a comprehensive predictive analytics framework that allows for rapid and iterative estimation of likelihoods (probabilities between 0 and 1) of adverse or positive outcomes. The framework includes analytic steps focused on (1) identifying the best samples for training statistical models and computing predictions; (2) processing and cleaning data; (3) creating and curating measures to include in modeling; (4) identifying the best modeling methods with an emphasis on ensembling; (5) estimating uncertainty in predictions; and (6) summarizing and interpreting results.

MDRC’s approach to the Fragile Families Challenge.

MDRC applied several analytic steps in our predictive analytics framework to the Fragile Families Challenge (FFC) — those focused on data processing, creating and curating measures, and modeling methods. (The other steps simply did not apply given the nature of the challenge.) The following describes the underlying premises that guided our analyses:

1. Invest deeply in measure creation — combining both substantive knowledge and automated approaches.

At MDRC, about 90 percent of the effort in any predictive analysis is dedicated to creating measures that extract as much predictive information as possible from the raw data. Doing so requires both subject matter expertise and familiarity with the data collection processes and context. It also involves recognizing opportunities for encoding information that may seem irrelevant or ancillary.

Extracting information can involve creating new, aggregate measures that summarize across multiple raw measures. Luckily, the FFC data already includes many valuable “constructed variables” that summarize raw survey responses (for instance, a measure of whether a mother meets depression criteria was constructed from multiple individual questions). There were other opportunities to create more aggregate measures as well. However, doing so can be very time-consuming when the number of raw measures is large. Relying on subject matter knowledge to prioritize which aggregate measures will likely be most predictive is key.

Extracting information can also involve the collapsing of categories from a single measure. For example, measures from survey questions asking “Household member’s relationship to you” has 18 possible non-missing values (spouse, partner, respondent’s mother, etc.). These 18 values can be grouped into types of relationships that are meaningful when it comes to predicting a particular outcome. Subject matter knowledge about the population and the outcome of interest can be helpful in determining the best groupings (for instance, does it matter whether the household member is an adult or does the particular kind of relationship matter?). However, automated algorithms are also an essential tool. Such algorithms can mine text in the responses, do clustering, and/or check the distributions of response choices to inform grouping selections. We have developed functions that process hundreds of variables with similar structures and transform them with just a few lines of code. Combining these approaches with subject area judgement can produce powerful results.

2. “Missingness” is informative and should not be “imputed away.”

In the FFC, we did no imputation of missing values, and we did not delete observations with missing values. In the case of predictive analytics, MDRC views missing values as containing predictive information. That is, the missingness may be for unmeasured reasons that correlate with the outcome of interest. Imputation would overwrite this information, often with inaccurate information, as even the most sophisticated techniques rest on unverifiable assumptions.

Therefore, we coded all measures in the FFC data into a series of dummy variables. Each dummy corresponds to a response or grouping of responses, including those related to missingness. For example, on measure in the mother questionnaire – “have a legal agreement or child support order” – we created three dummies that capture underlying reasons for missingness, as well as a dummie that captured the nonmissing response. We note here that by combining missingness codes, we are making assumptions that different types of missing have similar predictive value.

3. Eliminate unhelpful measures.

Because the number of measures available in the FFC data is large and because the coding of the survey responses was consistent across the measures, it made a lot of sense to automate the dummy creation described above. This multiplied the already large number of provided measures manifold. Not all of the resulting dummies held useful information. Therefore, we approached measure reduction as follows:

We only used measures from the mother, father and primary caregiver questionnaires, as these seemed to contain information relevant to the outcomes on which we were focusing (job training and eviction). When the same question was asked to all three, we only used the response from the questionnaire that corresponded to the primary caregiver at the age 9 follow-up (based on pcg5idstat). In doing this, we assumed the primary caregiver at the age 15 follow-up would be the same as the primary caregiver at the age 9 follow-up. If pcg5idstat was missing, we assumed the mother was the primary caregiver at the age 9 follow (as this was the case for 91 percent of the nonmissing responses). We included measures for the same primary caregiver in all previous waves.
Due to automation of dummy creation, we often ended up with dummies with only a very small number of 1’s or a very small number of 0’s. These measures held little useful information and we dropped them based on a custom filter.
We also ended up with many highly correlated dummy variables. We dropped all but of a set of measures with a correlation greater than 0.9.

4. Evaluate ‘learners’ based on out-of-sample performance.

In MDRC’s predictive analytics framework, we define a “learner” as some combination of (1) a set of predictors, (2) a modeling method or machine learning algorithm, and (3) any tuning parameters for the corresponding machine learning algorithm. For example, one learner might be defined the Random Forest algorithm using all of our dummy variables and with tuning parameter of the number of measures to select at each split set to 2.

We want to evaluate the performance of each learner based on how well it does when making predictions in new data — data not used for training or fitting the algorithm or model. Therefore, we use v-fold cross validation to mimic repeatedly fitting a model for a particular learner in one sample and then evaluating it in a different sample. For the FFC, we used 5-fold cross-validation. That is, we partitioned the training data into 5 folds (subsamples). We fit all learners in all but one of the folds. In the left-out “validation” fold, we computed predictions with each trained learner and computed the performance of each learner based on those predictions. The performance measure in the case of the FFC was Brier loss. We repeated the whole process 5 times such that each fold took a turn as the validation fold. The averages of the Brier loss scores were computed across all validation folds. (The entire process can be repeated multiple times in order to reduce the variance of the cross-validated estimates.)

For any given prediction problem, we cannot know which learner will perform best. Therefore, we define many learners. For the FFC, we ultimately defined only one set of predictors (which was all dummies we created), but we tried many machine learning algorithms designed for binary outcomes, and for many of the machine learning algorithms, we specified many combinations of tuning parameters.

5. Combine results from different learners with ensemble learning.

For our final model, we can select the learner with the best out-of-sample performance – the one with the lowest cross-validated Brier loss. Alternatively, we can combine multiple learners in order to improve the performance than could be achieved from any single learner. This is referred to as ensemble learning. Many of the algorithms commonly used in predictive analytics, such as Random Forest and Gradient Boosting Machine algorithms, are examples of ensemble learning. However, we can also ensemble across these and other algorithms or learners (in our case, combinations of algorithms and tuning parameter specifications). There are multiple approaches to ensemble learning. Perhaps the more common approach is stacking, or Super Learning (van der Laan, Polley and Hubbard, 2007).¹ Our implementation of stacking produced an error at the last minute so our FFC submission relied on predictions from the best-performing learner. However, ensemble learning has the potential to further improve our results.

More about MDRC

MDRC is committed to finding solutions to some of the most difficult problems facing the nation — from reducing poverty and bolstering economic self-sufficiency to improving public education and college graduation rates. We design promising new interventions, evaluate existing programs using the highest research standards, and provide technical assistance to build better programs and deliver effective interventions at scale. We work as an intermediary, bringing together public and private funders to test new policy-relevant ideas, and communicate what we learn to policymakers and practitioners — all with the goal of improving the lives of low-income individuals families and children.

For more about predictive analytics at MDRC, check out:

¹van der Laan, M., Polley, E. & Hubbard, A. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1). Retrieved 8 Nov. 2017, from doi:10.2202/1544-6115.1309

Ian Lundberg

October 18, 2017

Submission Description by Brian J. Goode – Imputing Values and Feature Reasoning

Uncategorized No comments

This guest blog post is written by Brian J. Goode, Discovery Analytics Center, Virginia Tech. The author was a winner of an Innovation Award.

Overview

One of the primary challenges of the Fragile Families Challenge (FFC) was to create a robust submission that is able to handle missing data. Of the nearly 44 million data points in the feature set, 55% of these values were either null, missing, or otherwise marked as incomplete. Discarding these data amounts to a substantial amount of information loss, and can potentially skew the data if there is any systematic reason as to why the nulls appear in the rows that they do. Imputing missing values preserves information content that was present, but introduces specific assumptions on the imputed values that may not always be verifiable. To the degree possible, the submission titled ‘bjgoode’ made use of the survey questionnaire to establish imputation rules based on the survey structure and familial proximity. As a result of this, the number of missing values decreased to 38% of the data set. The remaining missing data were filled in with the most frequent values. The implementation is straightforward, but tedious. The procedure is described below and resources are given at the bottom of this article. Results are given by the Fragile Families Challenge, but much work still needs to be done to evaluate the efficacy of the approach.

Procedure

There are four different approaches taken to impute values as part of this submission:

Figure 1. The various pathways for filling in missing data are shown in this diagram. The order of imputing values begins within each survey. Then Cross M-F imputing is completed. Finally, Cross year substitutions are made. The procedure reduced the number of missing values from 55% to 38% of the entire dataset.

1. Within Survey.

Some surveys, such as the mother/father baseline survey, have multiple pathways that an interview can follow depending on specific circumstances.
For example, there are whole blocks of questions that will be answered or not on the basis of whether or not the parents are romantically involved, partially romantically involved, and married. This means that by survey design, we can deduce that some questions are meant to have null values due to the pathway that was taken. For questions that are specific to the circumstance, there is little we can do. However, there were a number of repeated questions within these surveys that can be cross-linked. An example of this from the mother baseline survey (B5, B11, B22) is:

I’m going to read you some things that couples often do together. Tell me which ones you and [BABY’S FATHER] did during the last month you were together.

These questions were identified by text matching. When a value appeared in one path, it was transferred to the same question in the missing pathway. This reduced the number of missing values in the feature data from 55% to 51% missing.

2. Cross M-F.

One other reason for having missing values is that only one parent is actively involved in the study. For these cases, there is likely to be only one survey out of the mother, father, or primary caregiver surveys for a given survey wave.
In this case, we can impute data by finding related questions in each of these surveys with each wave. The value was transferred to each other matching survey question. The result of this is that some survey questions, instead of being answered strictly as mother, father, or primary caregiver are answered from the more general prototype of ‘supporting adult’ when viewed in isolation. If the data were used to form a complex structure of a specific parent, the non-trivial assumption is that the parents would have answered similarly. During each wave, the format and structure of the mother/father surveys were very similar within each section. Using this, it was faster and more accurate to do the mapping by hand by specifying question ranges. The mappings are provided in the Github repository listed in the Resources section. This procedure was performed twice. The first iteration reduced the output from the within survey mapping from 51% to 45% missing values. The second iteration reduce the output from the cross year mapping from 39% to 38% missing values.

3. Cross Year (wave).

Missing values also appeared to be more common during the later waves of the survey. This is not surprising given that it is a longitudinal survey and there are expected to be dropouts. To fill in these values, an assumption was made that it is more likely that a given mother/father survey response will remain persistent across time than not. As a cautionary note, this is a very strong assumption, especially for questions that have a smaller time scale. To avoid probing into specific questions, and assessing whether or not to use the latest known value or some other method for imputing, the mean value across available years was taken (note: all answers to survey questions were encoded with ordinal values).

The challenge here was to identify the related questions across years, because the same question was noted to be worded multiple different ways over multiple waves. This type of matching problem also has the characteristic that it is too cumbersome as a one-off instance to train an algorithm, yet also too tedious and error-prone for a human to match. The solution was to create a simple algorithm based on the NLTK Natural Language Processing Tookit in Python to identify similar questions by text. Having too few samples to properly train, a simple threshold was used to cluster questions into groups of related categories. However, thresholds have the ability to be both too conservative and too liberal in the error depending on the text and the type of changes that were made. Therefore, humans in the loop were included by having the script “propose” both correct and incorrect survey items within each cluster. A sample output is given here:

Figure 2. Example of output code for sets of related questions.

This process was much simpler and required little effort. However, without a gold standard for comparison, the exact accuracy of the algorithm cannot be stated with confidence. The code is available on the Github repository in the Resources section. Of the missing values, after the first cross M-F matching, the cross year matching reduced the number of missing values from 45% to 39%.

4. Output Specific.

The last major addition to the imputing strategy was to mimic the output measure being investigated as best as possible. All of the model outputs (outcomes) were derived from survey features and made public on the Fragile Families Blog (e.g., Material Hardship). For most of the outputs, except for GPA, there was a history of previous responses from surveys in previous waves. Therefore, for each of the outputs, a feature was made to correspond to the output in each survey where applicable. This was particularly helpful for features like Material Hardship that were formed from multiple survey questions and had the added effect of acting like an “OR”. Consequently, this is where the biggest performance gain was seen, but had little effect on the number of missing values.

After the steps were applied above, the remaining 38% of missing values were imputed with the most frequent value from each feature. All features exhibiting no entropy (all same values) were removed.

Results

The training and validation phase of the modelling phase showed that linear regression models were best for ordinal outputs: GPA, Grit, and Material Hardship. The remaining model outputs were best fit by logistic regressions. Although L₁-regularization was implemented, for many of the outputs, the features were reduced to include only subjectively relevant features. For the case of Grit and Material Hardship, the features corresponding to the definition of the measure were picked. The feature combinations are too many to list here, but are shown in the code linked by the Resources section. Admittedly, this is not a fully automated procedure nor one grounded in theory, and is very likely to vary between researchers. However, I contend that this is evidence that we need to consider the larger model-system that includes both model design and resource constraints such as time. This will help us better understand how model development decisions impact the result and final implementation.

To fully understand the cost/benefit of the above imputation strategy one would need to conduct an ablation study and include other methods of imputation. Due to time constraints, that was not possible. But, from the design, matching the outputs appeared to show the greatest performance increase during the validation phase. As an approximate indicator of performance, the mean squared error (mse) and rank of each model using this data set is provided relative to the baseline here: FFC Results. Of note, the model is ranked 5th and 9th in the Material Hardship and Layoff outcomes respectively, but there were many better performing models. So, there is still an open question of the utility of this strategy in terms of overall performance, interpretability of imputing, and similarity of individual sample outputs.

What Next?

The work described above focuses on how data was imputed and selected to fill in missing values for the Fragile Families Challenge. However, more detailed analysis needs to be completed in order to reason about the strategy (or any strategy) with respect to the data, the challenge results, and the models themselves. This is currently ongoing and anticipated to be discussed in the forthcoming Socius submission as well as during the talk at the FFC Workshop on November 16th, 2017.

Author Details

Brian J. Goode, Discovery Analytics Center, Virginia Tech

I would like to Dichelle Dyson and Samantha Dorn for their help.

Resources

Github Repository: https://github.com/bjgoode/ffc-public

Matt Salganik

October 3, 2017

Upcoming event: The Future of Big Data and Survey Methods

Uncategorized No comments

We are excited to have a chance to discuss the Fragile Families Challenge as part of a panel at the University of Michigan, Institute for Social Research. The title of the panel is: The Future of Big Data and Survey Methods. Please join us at the event. More information is below.

Description:
New Data Science methods and mass collaborations pose both exciting opportunities and important challenges for social science research. This panel will explore the relationship between these new approaches and traditional survey methodology. Can they coexist, or even enrich one another? Matthew Salganik is one of the lead organizers of the Fragile Families Challenge, which uses data science approaches such as predictive modeling, mass collaboration, and ensemble techniques. Jeremy Freese is co-PI of the General Social Survey and of a project on collaborative research in the social sciences. Colter Mitchell has conducted innovative work combining biological data and methods with Fragile Families and other survey data sets.

Sponsored by the Computational Social Science Rackham Interdisciplinary Workshop and the Population Studies Center’s Freedman Fund.

Friday, 10/6/2017, 3:10 PM
Location: 1430 ISR-Thompson

Matt Salganik

September 29, 2017

Correction to prize winners

Uncategorized No comments

When newspapers have to correct a published article, they issue a correction that notes the errors in the prior version and how they have been corrected. Following this logic, this blog post explains a correction we have made to the prize winners blog.

At the close of the Challenge, one team (MDRC) mistakenly believed that the submission deadline, listed as 6pm UTC on Codalab, was 6pm Eastern Time. After the close of the Challenge at 2pm, they were unable to upload their submission. They emailed us very soon after the 2pm deadline indicating that they had misunderstood. Our Board of Advisors reviewed the case carefully and decided to accept this submission. We made this decision before we opened the holdout data.

When we actually evaluated the submissions with the holdout data, we downloaded all final submissions from Codalab and neglected to add the e-mailed MDRC submission to the set. The team noticed they were not on the final scores page and emailed us to ask. A week after opening the holdout set, we added their submission to the set, re-evaluated all scores, and discovered that this team had achieved the best score in eviction and job training, two prizes we had already awarded to other teams.

In consultation with our Board of Advisors, we decided to do three things.

First, we updated the final prize winners to recognize MDRC.

Second, we recognized that this was an unusual situation. Other teams had rushed to the 2pm deadline and might have scored better with a few extra hours of work. For this reason, we decided to create a new category: special honorary prizes. If MDRC won for an outcome, the second-place team (i.e. the team that was in first place at the close of the Challenge at 2pm) would be awarded a special honorary prize.

Third, we updated the prize winners figure and score ranks to include MDRC along with all submissions previously included.

All prize winners (final, progress, innovation, foundational, and special honorary) are invited to an all-expense-paid trip to Princeton University to present their findings at the scientific workshop.

Matt Salganik

September 15, 2017

Prize winners

Uncategorized 1 comment

The Fragile Families Challenge received over 3,000 submissions from more 150 teams between the pilot launch on March 3, 2017, and the close on August 1, 2017. Each team’s final submission score on the holdout set is provided at this link. In this blog post, we are excited to announce the prize winners!

Final prizes

We are awarding prizes to the top-scoring submissions for each outcome, as measured by mean-squared error. The winners are:

GPA: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
Grit: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
Material hardship: haixiaow (Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi; Ph.D. students, Department of Politics, Princeton University)
Eviction: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)
Layoff: Pentlandians (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
Job training: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)

Progress prizes

As promised, we are also awarding progress prizes to the top-scoring submissions for each outcome among submissions made by May 10, 2017 at 2pm Eastern Time. The following teams had the best submission as of this deadline are:

GPA: ovarol (Onur Varol, postdoctoral researcher at the Center for Complex Network Research, Northeastern University Networks Science Institute)
Grit: rap (Derek Aguiar, Postdoctoral Researcher, and Ji-Sung Kim, Undergraduate Student, Department of Computer Science, Princeton, NJ)
Material hardship: ADSgrp5
Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
Layoff: the_Brit (Professor Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK)
Job training: nmandell (Noah Mandell, Ph.D. candidate in plasma physics at Princeton University)

Foundational award

Greg Gunderson (ggunderson) produced machine-readable metadata that turned out to be very helpful for many participants. You can read more about the machine-readable metadata in our blog post on the topic. In addition to being useful to participants, this contribution was also inspirational for the Fragile Families team. They saw what Greg did and wanted to build on it. A team of about 8 people is now working to standardize aspects of the dataset and make more metadata available. Because Greg provided a useful tool for other participants, open-sourced all aspects of the tool, and inspired important changes that will make the larger Fragile Families project better, we are awarding him the foundational award.

Innovation awards

The Board of Advisers of the Fragile Families Challenge would also like to recognize several teams for particularly innovative contributions to the Challenge. For these prizes, we only considered teams that were not already recognized for one of the awards above. Originally, we planned to offer two prizes: “most novel approach using ideas from social science” and “most novel approach using ideas from data science.” Unfortunately, this proved very hard to judge because many of the best submissions combined data science and social science.

Therefore, after much deliberation and debate, we have decided to award two prizes to for innovation. These submissions each involved teams of people working collaboratively. Each team thought carefully about the raw data and cleaned variables manually to provide useful inputs to the algorithm, much as a social scientist typically would. Each team then implemented well-developed machine learning approaches to yield predictive models.

We are recognizing the following teams:

bjgoode (Brian J. Goode, Virginia Tech, acknowledging Dichelle Dyson and Samantha Dorn)
carnegien (Nicole Carnegie, Montana State University, and Jennifer Hill and James Wu, New York University)

We are encouraging these teams to prepare blog posts and manuscripts to explain their approaches more fully. To be clear, however, there were many, many innovative submissions, and we think that a lot of creative ideas were embedded in code and hard to extract from the short narrative explanations. We hope that all of you will get to read about these contributions and more in the special issue of Socius.

Special honorary prizes

As explained in our correction blog post, our Board of Advisors decided to accept a submission that arrived shortly after the deadline, because of confusing statements on our websites about the hour at which the Challenge closed. This team (mdrc) had the best score for two outcomes (eviction and job training) and was awarded the final prize for each of these outcomes. Because we recognize that this was an unusual situation, we are awarding special honorary prizes to the second-place teams for each of these outcomes.

Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
Job training: malte (Malte Moeser, Ph.D. student, Department of Computer Science, Princeton University)

Conclusion

Thank you again to everyone that participated. We look forward to more exciting results to come in the next steps of the Fragile Families Challenge, and we hope you will join us for the scientific workshop (register here) at Princeton University on November 16-17!

Alex Kindel

September 15, 2017

Understanding your score on the holdout set

Uncategorized No comments

We were excited to release the holdout scores and announce prize winners for the Fragile Families Challenge. Our guess is that some people were pleasantly surprised by their scores and that some people were disappointed. In this post, we provide more information about how we constructed the training, leaderboard, and holdout sets, and some advice for thinking about your score. Also, if you plan to submit to the special issue of Socius—and you should—you can request scores for more than just your final submission.

Constructing the training, leaderboard, and holdout set

In order to understand your score, it is helpful to know a bit more about how we constructed the training, leaderboard, and holdout sets. We split the data into three sets: 4/8 training, 1/8 leaderboard, and 3/8 holdout.

In order to make each dataset as similar as possible, we selected them using a form of systematic sampling. We sorted observations by city of birth, mother’s relationship with the father at birth (cm1relf), mother’s race (cm1ethrace), whether at least 1 outcome is available at age 15, and the outcomes at age 15 (in this order): eviction, layoff of a caregiver, job training of a caregiver, GPA, grit, and material hardship. Once observations were sorted, we moved down the list in groups of 8 observations at a time and, for each group, randomly selected 4 observations to be in the training set, 1 to be in the leaderboard set, and 3 to be in the holdout set. This systematic sampling helped reduce the chance of us getting a “bad draw” whereby the datasets would differ substantially due to random chance.

All three datasets—training, leaderboard, holdout—include cases for which no age 15 outcomes have been collected yet. We decided to include these cases because data might be collected from them in the future and for some methodological research it might be interesting to compare predictions even if the truth is not known.

For the cases with no outcome data in the leaderboard set—but not the training and holdout sets—we added random imputed outcome data. We did this by randomly sampling outcomes with replacement from the observed outcomes in the leaderboard set. For example, the leaderboard included 304 observed cases for GPA and 226 missing cases imputed by random sampling with replacement from the observed cases.

Randomly imputing outcome data is a bit unusual. Our main reason for setting up the leaderboard this way was to develop a method for assessing model overfitting without opening the holdout set. In scientific challenges like the Fragile Families Challenge, participants can continuously improve their leaderboard scores over time, providing the appearance of real progress in constructing a good model. But, when assessed with the holdout set, that progress turns out to be an illusion: the final score is much worse than expected. This scenario is what happens when participants overfit their models to the leaderboard data. Because of this property, the leaderboard is a bad measure of progress: it misleads participants about the quality of their models. So, when calculating leaderboard score we used both real outcome data and randomly imputed outcome data. The imputed subset is randomly drawn, which means that score improvement on those observations over time is a clear indicator of overfitting to the leaderboard set. By disaggregating leaderboard scores into a real data score and an imputed data score behind the scenes, we were able to model how well participant submissions would generalize without looking at the holdout set.

If you would like to learn more about the problems with leaderboard scores, Moritz Hardt, a member of our Board of Advisors, has a paper on this problem: http://proceedings.mlr.press/v37/blum15.html.

Interpreting your score on the holdout set

You might be pleasantly surprised that your score on the holdout set was better than your score on the leaderboard set, or you might be disappointed that it was worse. Here are a few reasons that they might be different:

Overfitting the leaderboard: One reason that your performance might be worse on the holdout set is overfitting to the leaderboard. If you made many submissions and your submissions seemed to be improving, this might not actually be real progress. We had an earlier post on how the leaderboard can be misleading. Now that the first stage of the Challenge is complete, when you request scores on the holdout set, we will send you a graph of your performance over time on the real outcome data in the leaderboard as well as your performance on the imputed outcome data in the leaderboard. If your performance seems to be improving over time on the imputed outcome data, that is a sign that you have been overfitting to the leaderboard.

For example, consider these two case where the red line shows performance on the imputed leaderboard data and the blue line shows performance on the real leaderboard data.

In the first case, the performance on the real data improved (remember lower is better) and performance on the imputed data did not improve. In the second case, however, performance on the imputed data improves quite a bit, while performance on the real data remains relatively static. In this case, we suspect overfitting to the leaderboard, and we suspect that this person will perform worse on the holdout set than the leaderboard set.

Noisy leaderboard signal: One reason that your score might be better on the holdout set is that the leaderboard set included the randomly imputed outcome data. Your predictions for these cases were probably not very good, and there are no randomly imputed outcome cases in the holdout set (cases with no outcome data in the holdout set are ignored).

Random variation: One reason that your score on the holdout set could be higher or lower is that there are not a huge number of cases in the leaderboard set (530 people) or the holdout set (1,591 people). Also, of these roughly one third are missing outcome data. With this few cases, you should expect that your score will vary some from dataset to dataset.

Conclusion

We hope that this background information about the construction of the training, leaderboard, and holdout sets helps you understand your score. If you have any more questions, please let us know.

Matt Salganik

August 1, 2017

Fragile Families Challenge, next steps

Uncategorized No comments

Stage one of the Fragile Families Challenge, the predictive modeling stage, ended today at 2pm ET. We are grateful to everyone who participated. This is not, however, the end of the Fragile Families Challenge. In fact, there are many important and exciting things to come. We will be:

hosting the Fragile Families Challenge meet-up at the American Sociological Association Annual Meeting, Sunday August 13 at 2pm.
awarding prizes.
open sourcing the submissions to the Challenge.
publishing a single paper presenting the design and results from the Challenge.
publishing a special issue of Socius, a new open access journal published by the American Sociological Association, about the Fragile Families Challenge.
hosting the Fragile Families Challenge Scientific Workshop on November 16 & 17.
conducting research to discover important and unmeasured factors, prioritize issues for intervention, and compare modeling approaches.

We are looking forward to all of the next steps in the Fragile Families Challenge.

Matt Salganik

August 1, 2017

Fragile Families Challenge Scientific Workshop, Nov 16 & 17

Uncategorized No comments

We are happy to announce the Fragile Families Challenge Scientific Workshop will take place November 16th and 17th (Thursday and Friday) at Princeton University. The workshop is open to everyone interested in the Challenge, and we will be livesteaming it for people who are not able to travel to Princeton (note: videos of the talks are now available).

On Thursday, we will meet in Palmer House (map). On Friday, we will meet in Wallace Hall 300 (map).

The schedule will be:

Thursday: Workshop of submissions to the special issue of Socius and breakout projects. All are welcome regardless of whether you have written a paper.
Friday: Presentations by prize winners. All are welcome to attend, regardless of whether you have won a prize.

If you plan to join us, please complete the registration form.

Thursday, November 16

The first day of the Fragile Families Challenge Scientific Workshop will be devoted to: 1) workshopping papers submitted to the Special Issue of Socius on the Fragile Families Challenge and 2) working on breakout projects.

Workshopping papers

Before the workshop, each participant will be sent 3 papers to read. Participants are expected to read these papers before arriving at the workshop. This will ensure that we have a lively and focused discussion. If you are unable to read the papers ahead of time, please let us know and plan to arrive at lunch time.

Then at the workshop, each paper will be discussed for 45 minutes in a series of parallel roundtable sessions. There will be no presentations by the author because everyone will have read the paper ahead of time. Instead, each session will begin with very brief comments by a pre-assigned moderator, and then there will be a group discussion, which will be facilitated by the moderator. We expect these to be lively, fascinating, and generative discussions.

“Black Box Models and Sociological Explanations: Predicting GPA Using Neural Networks” by Thomas Davidson
“Humans in the Loop: Priors and Missingness on the Road to Prediction” by Connor Gilroy, Anna Filippova, Ridhi Kashyap, Antje Kirchner, Allison Morgan, Kivan Polimis, Adaner Usmani, and Tong Wang
“Privacy, ethics, and high-dimensional social science data: A case study of the Fragile Families Challenge” by Ian Lundberg, Arvind Narayanan, Karen E.C. Levy, and Matthew J. Salganik
“Making the analysis of complex survey data more efficient, reliable, and enjoyable: A case study from the Fragile Families Challenge” by Alexander T. Kindel, Kristin Catena, Tom Hartshorne, Kate Jaeger, Dawn Koffman, Sara S. McLanahan, Maya Phillips, Shiva Rouhani, and Matthew J. Salganik
“Modeling and Decision Making with Social Systems: Lessons Learned from the Fragile Families Challenge” by Brian Goode, Debanjan Datta, and Naren Ramakrishnan
“The Pentlandians ensemble: Winning models for GPA, grit, and layoff in the Fragile Families Challenge” by Daniel Rigobon, Eaman Jahani, Yoshihiko Suhara, Khaled AlGhoneim, Abdulaziz Alghunaim, Alex Pentland, and Abdullah Almaatouq
“Predicting material hardship using machine learning” by Erik H. Wang, Diana Stanescu, and Soichiro Yamauchi
“The challenges of data science from social science: Using social science knowledge in the Fragile Families Challenge” by Stephen McKay
“Predictive features of children GPA in Fragile Families” by Naijia Liu, Hamidreza Omidvar, and Jinjin Zhao
“Variable selection and parameter tuning for BART modeling in the Fragile Families Challenge” by James Wu and Nicole Carnegie

Breakout activities

With so many amazing people all in one place, we also wanted to leave time for ideas that you propose, either ahead of time or as a result of the workshopping of the papers. Any participant can propose an idea, and then people can choose which one they want to work on. We’re also going to propose the following projects:

Code walkthrough for the new Fragile Families metadata API (lead by Maya Phillips)
Demo and testing for new metadata website (lead by Alex Kindel)
Testing and improving the Docker container that ensure reproducibility of the special issue (lead by David Liu)
Assessing test-retest reliability of concept tags (lead by Kristin Catena)
Help us digitize question and answer texts (lead by Tom Hartshorne)

Schedule for Thursday

08:30 – 09:00 Breakfast
09:00 – 09:15 Intro
09:15 – 10:00 Round 1 of papers
10:00 – 10:15 Break
10:15 – 11:00 Round 2 of papers
11:00 – 11:15 Break
11:15 – 12:00 Round 3 of papers
12:00 – 1:00 Lunch
1:00 – 1:30 Discussion of project ideas
1:30 – 5:00 Breakout activities
5:00 – 6:00 Break
6:00 – ??? Dinner

Friday, November 17

The second day of the Fragile Families Challenge Scientific Workshop will be devoted to presentations from the organizers and prize winners. Videos of these talks are now available.

8:30 – 9:00. Breakfast
9:00 – 9:45. Welcome, Overview of the Fragile Families Challenge
- Matthew J. Salganik, Professor of Sociology, Princeton University
- Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study
9:45 – 10:00. Break
10:00 – 11:15. Presentations from progress prize winners and discussion
- Onur Varol, Postdoctoral Researcher, Center for Complex Network Research, Networks Science Institute, Northeastern University
- Julia Wang, Princeton University
- Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK
11:15 – 11:30. Break
11:30 – 12:00. Presentation from foundational prize winner and discussion
- Gregory Gundersen, PhD Student in Computer Science, Princeton University
12:00 – 1:00. Lunch
1:00 – 2:00. Presentations from innovation prize winners and discussion
- Nicole Carnegie, Assistant Professor of Statistics, Montana State University
- Brian J. Goode, Research Scientist, Discovery Analytics Center, Virginia Tech
2:00 – 2:30. Break
2:30 – 4:00 Presentations from final prize winners and discussion
- Kristin Porter, Senior Associate, MDRC
- Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi, Ph.D. students, Department of Politics, Princeton University
- Abdullah Almaatouq (MIT), Eaman Jahani (MIT), Daniel E. Rigobon (MIT), Yoshihiko Suhara (Recruit Institute of Technology and MIT)
4:00 – 4:30. Break
4:30 – 5:00. What’s next
- Matthew J. Salganik, Professor of Sociology, Princeton University
- Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study

We will update this page as we have more information. If you have any questions about the Scientific Workshop, please email us.

Matt Salganik

July 20, 2017

Getting scores on holdout data

Uncategorized 2 comments

As described in an earlier blog post, there will be a special issue of Socius devoted to the Fragile Families Challenge. We think that the articles in this special issue would benefit from reporting their scores on both the leaderboard data and the holdout data. However, we don’t want to release the holdout data on August 1 because that could lead to non-transparent reporting of results. Therefore, beginning on August 1, we will do a controlled release of the scores on the holdout data. Here’s how it will work:

All models for the special issue must be submitted by August 1.
Between August 1 and ~~October 1~~ October 16 you can complete a web form requesting scores on the holdout data for a list of the models. We will send you those scores.
You must report all the scores you requested in your manuscript or the supporting online material. We are requiring you to report all the scores that you request in order to prevent selective reporting of especially good results.

We realize that this procedure is a bit cumbersome, but we think that this extra step is worthwhile in order to ensure the most transparent reporting possible of results.

Submit your request for scores here.

Matt Salganik

July 20, 2017

Event at the American Sociological Association Meeting

Uncategorized No comments

We are happy to announce that there will be a Fragile Families Challenge event Sunday, August 13 at 2pm at the American Sociological Association Annual Meeting in Montreal. We will gather at the Fragile Families and Child Wellbeing Study table in the Exhibit Hall (220c). We are the booth in the back right (booth 925). This will be a great chance to meet other participants, share experiences, and learn more about the next stages of the mass collaboration and the Fragile Families study more generally. See you in Montreal!

Matt Salganik

July 19, 2017

A Data Pipeline for the Fragile Families Challenge

Uncategorized 1 comment

Guest blog post by Anna Filippova, Connor Gilroy, and Antje Kirchner

In this post, we discuss the challenges of preparing the Fragile Families data for modeling, as well as the rationales for the methods we chose to address them. Our code is open source, and we hope other Challenge participants find it a helpful starting point.

If you want to dive straight into the code, start with the vignette here.

Data processing

The people who collect and maintain the Fragile Families data have years of expertise in understanding the data set. As participants in the Fragile Families Challenge, we had to use simplifying heuristics to get a grasp on the data quickly, and to transform as much of it as possible into a form suitable for modeling.

A critical step is to identify different variables types, or levels of measurement. This matters because most statistical modeling software transforms categorical covariates into a series of k – 1 binary variables, while leaving continuous variables untransformed. Because categorical variables are stored as integers, with associated strings as labels, a researcher could just use those integers directly in a model instead—but there is no guarantee that they would be substantively meaningful. For interpretation, and potentially for predictive performance, accounting for variable type is important.

This seems like a straightforward problem. After all, it is typically clear whether a given variable is categorical or continuous from the description in the codebook. With a handful of variables, classifying them manually is a trivial task, but this is impossible with over 12,000 variables. An automated solution that works well for the majority of variables is to leverage properties of the Stata labels, using haven, to convert each variable into the appropriate R class—factor for categorical variables, numeric for continuous. We previously released the results of this work as metadata, and here we put it to use.

A second problem similarly arises from the large number of variables in the Fragile Families data. While some machine learning models can deal with many more parameters than observations (p >> n), or with high amounts of collinearity among covariates, most imputation and modeling methods run faster and more successfully with fewer covariates. Particularly when demonstrating or experimenting with different modeling approaches, it’s best to start out with a smaller set of variables. If the constructed variables represent expert researchers’ best attempts to summarize, consolidate, and standardize survey responses across waves, then those variables make a logical starting point. Fortunately, most of these variables can be identified with a simple regular expression.

Finally, to prepare for imputation, Stata-style missing values (labelled negative numbers) need to be converted to R-style NAs.

Missing data

Data may be missing in a (panel) study for many reasons, including respondent’s unwillingness to answer a question, a don’t know response, skip logic (for questions that do not apply to a given respondent), and panel attrition (for example, due to locating difficulties for families). Additional missing data might be due to data entry errors and—particularly relevant for the challenge—anonymization to protect sensitive information of members of a particularly vulnerable population.

What makes missing data such a challenge for computational approaches? Many statistical algorithms operate on complete data, often obtained through listwise deletion of cases. This effectively assumes that instances are missing completely at random. The Fragile Families data are not missing completely at random; moreover, the sheer amount of missingness would leave few cases remaining after listwise deletion. We would expect a naive approach to missingness to significantly reduce the predictive power of any statistical model.

Therefore, a better approach is to impute the missing data, that is, make a reasonable guess about what the missing values could have been. However, current approaches to data imputation have some limitations in the context of the Fragile Families data:

Standard packages like Amelia perform multiple imputation from a multivariate normal distribution, hence they are unable to work on the full set of 12,000 covariates with only 4,000 observations This is also computationally intensive, taking several hours to run even when using a regularizing prior, a subset of variables, and running individual imputations in parallel.
Another promising approach would be to use Full Information Maximum Likelihood estimation. FIML estimation models sparse data without the need for imputation, thus offering better performance. However, no open-source implementation for predictive modeling with FIML exists at present.
We could also use the existing structure of the data to make logical edits. For instance, if we know a mother’s age in one wave, we can extrapolate this to subsequent waves if those values are missing. Carrying this idea a step further, we can make simple model-based inferences; if, for example, a father’s age is missing entirely, we can impute this from the distribution of differences between mother’s and father’s ages. This process, however, requires treating each variable individually.

To address some of these issues, our approach to missing data considers each variable in the data-set in isolation (for example cm1hhinc, mother’s reported household income at wave 1), and attempts to automatically identify other variables in the data-set that may be strongly associated with this variable (such as cm2hhinc, mother’s reported household income at wave 2 and cf1hhinc, father’s reported household income at wave 1). Assembling a set of 3 to 5 of such associations per variable allows us to construct a simple multiple-regression model to predict the possible value of the missing data for each column (variable) of interest.

Our approach draws on two forms of multiple-regression models, a simple linear ordinary-least squares regression, and a linear regression with lasso penalization. To evaluate their performance, we compare our approach to two alternative forms of imputation: a naive mean-based imputation, and imputation using the Amelia package. Holding constant the method we use to make predictions and the variables used, our regression-based approach outperforms mean imputation on the 3 categorical outcome variables: Eviction, Layoff, and Job Training. The Lasso imputation also outperforms Amelia on these variables, but the unpenalized regression imputation has mixed effects. Interestingly, mean imputation performs the best for GPA and Grit, and we saw a similar performance on Material Hardship using mean imputation, Amelia, and linear regression, but Lasso was significantly worse than the former approaches. Overall, even simple mean imputation performed better than using Amelia on this dataset.

The approach we used comes with a number of assumptions:

We assume that the best predictors of any given variable already exist in the Fragile Families dataset, and do not need significant processing. This is not an unreasonable assumption, as many variables in the dataset are collected across different waves, thus there may be predictable relationships between each wave.
Our tests above assume a linear relationship between predictor variables and the variable we impute, although our code has an option to also take into account polynomial effects (the ‘degree’ option available when using method=’lasso’).
To get complete predictions for all 4000 cases using the regression models, we needed to first impute means of the covariates used for the imputation. In other words, in order to fill in missing data, we paradoxically needed to first fill in missing data. FIML is one solution to this challenge, and we hope to see this make its way into predictive modelling approaches in languages like R or Python.

Our pipeline

We modularized our work into two separate repositories, following the division of labor described above.

For general data processing, ffc-data-processing, which

Works from the background.dta Stata file to extract covariate information.
Provides helper functions for relatively fast data transformation.

For missing data imputation, FFCRegressionImputation, which

Prepares the raw background.csv data and performs a logical imputation of age-related variables as we describe above.
Constructs a (correlation) matrix of strengths of relationships between a set of variables.
Uses the matrix to perform a regression-based prediction to impute the likely value of a missing entry.

For a technical overview of how these two bodies of code integrate with each other, check out the integration vignette. The vignette is an RMarkdown file which can be run as-is or freely modified.

The code in the vignette subsets to constructed variables, identifies those variables as either categorical or continuous, and then only imputes missing values for the continuous variables, using regression-based imputation. We chose to restrict the variables imputed for illustrative purposes, and to improve the runtime of the vignette. Users of the code can and should employ some sort of imputation strategy—regression-based or otherwise—for the categorical variables before incorporating the covariates into a predictive model.

Reflections

What seemed at the beginning to be a straightforward precursor to building predictive models turned out to have complexities and challenges of its own!

From our collaboration with others, it emerged that researchers from different fields perceive data problems very differently. A problem that might not seem important to a machine-learning researcher might strike a survey methodologist as critical to address. This kind of cross-disciplinary communication about expectations and challenges was productive and eye-opening.

In addition, the three of us came into this project with very different skillsets. We settled on R as a lingua franca, but drew on a much broader set of tools and techniques to tackle the problems posed by the Fragile Families Challenge. We would encourage researchers to explore all the programming tools at their disposal, from Stata to Python and beyond.

Finally, linking everyone’s efforts together into a single working pipeline that can be run end-to-end was a significant step by itself. Even with close communication, it took a great deal of creativity as well as clarity about desired inputs and outputs.

We hope that other participants in the Fragile Families Challenge find our tools and recommendations useful. We look forward to seeing how you can build on them!

Ian Lundberg

July 15, 2017

Helpful idea: Compare to the baseline

Uncategorized No comments

Participants often ask us if their scores on the leaderboard are “good”. One way to answer that question is with a comparison to the baseline model.

In the course of discussing how a very simple model could beat a more complex model, this post will also discuss the concept of overfitting to the training data and how this could harm predictive performance.

What is the baseline model?

We have introduced a baseline model to the leaderboard, with the username “baseline.” Our baseline prediction file simply takes the mean of each outcome in the training data, and predicts that mean value for all observations. We provided this file as “prediction.csv” in the original data folder sent to all participants.

How is the baseline model performing?

As of the writing of this post (12:30pm EDT on 15 July 2017), the baseline model ranks as follows, with 1 being the best score:

70 / 170 unique scores for GPA
37 / 128 for grit
60 / 99 for material hardship
37 / 96 for eviction
32 / 85 for layoff
30 / 87 for job training

In all cases except for material hardship, the baseline model is in the top half of scores!

A quick way to evaluate the performance of your model is to see the extent to which it improves over the baseline score.

How can the baseline do so well?

How can a model with no predictors outperform a model with predictors? One source of this conundrum is the problem of overfitting.

As the complexity of a model increases, the model becomes more able to fit the idiosyncracies of the training data. If these idiosyncracies represent something true about the world, then the more complex fit might also create better predictions in the test data.

However, at some point, a complex model will begin to pick up random noise in the training data. This will reduce prediction error in the training sample, but can make predictions worse in the test sample!

Note: Figure inspired by Figure 7.1 in The Elements of Statistical Learning by Hastie, Tibshirani, and Freedman, which provides a more thorough overview of the problem of overfitting and the bias-variance tradeoff.

How can this be? A classical result in statistics shows that the mean squared prediction error can be decomposed into the bias squared plus the variance. Thus, even if additional predictors reduce the bias in predictions, they can harm predictive performance if they substantially increase the variance of predictions by incorporating random noise.

What can be done?

We have been surprised at how a seemingly small number of variables can yield problems of overfitting in the Fragile Families Challenge. A few ways to combat this problem are:

Choose a small number of predictors carefully based on theory
Use a penalized regression approach such as LASSO or ridge regression.
- For an intuitive introduction to these approaches, see Efron and Hastie Computer Age Statistical Inference [book site], sections 7.3 and 16.2.
- The glmnet package in R [link] is an easy-to-use implementation of these methods. Many other software options are also available.
Use cross-validation to estimate your model’s generalization error within the training set. For an introduction, see chapter 12 of Efron and Hastie [book site]

But at minimum, compare yourself to the baseline to make sure you are doing better than a naive prediction of the mean!

Matt Salganik

July 12, 2017

Metadata about variables

Uncategorized No comments

We are happy to announce that Challenge participant Connor Gilroy, a Ph.D. student in Sociology at the University of Washington, has created a new resource that should make working the Challenge data more efficient. More specifically, he created a csv file that identifies each variable in the Challenge data file as either categorical, continuous, or unknown. Connor has also open sourced the code that he used to create the csv file. We’ve had many requests for such a file, and Connor is happy to share his work with everyone! If you want to check and improve Connor’s work, please consult the official Fragile Families and Child Wellbeing Study documentation.

Connor’s resource is part of a tradition during the Challenge whereby people have open sourced resources to make the Challenge easier for others. Other resources include:

If you have something that you’d like to open source, please let us know.

Finally, Connor work was part of a larger team project at the Summer Institute in Computational Social Science to build a full data processing pipeline for the Fragile Families Challenge. Stay tuned for that blog post on Tuesday, July 18!

Matt Salganik

June 19, 2017

Call for papers, special issue of Socius about the Fragile Families Challenge

Uncategorized No comments

Socius Call for Papers
Special issue on the Fragile Families Challenge
Guest editors: Matthew J. Salganik and Sara McLanahan

Socius, an open access journal published by the American Sociological Association, will publish a special issue on the predictive modeling phase of the Fragile Families Challenge. All participants in the Fragile Families Challenge are invited to submit a manuscript to this special issue.

A strong manuscript for the special issue will describe the process of creating a submission to the Challenge and will describe what was learned during that process. For example, a strong manuscript will describe the different approaches that were considered for data preprocessing, variable selection, missing data, model selection, and any other steps involved in creating the final submission to the Challenge. Further, a strong manuscript will also describe how the authors decided among the many possible approaches. Finally, some manuscripts may seek to draw more general lessons about social inequality, families, the common task method, social science, data science, or computational social science. Manuscript should be written in a style that is accessible to a general scientific audience.

The editors of the special issue may also consider other types of manuscripts that are consistent with the scientific goals of the Fragile Families Challenge. If you are considering submission a manuscript different from what is described above, please contact the editors of the special issue at fragilefamilieschallenge@gmail.com before submitting your manuscript.

All papers will be peer reviewed, and publication is not guaranteed. However, there is no limit on the number of articles that will be accepted in the special issue. All published papers must abide by the terms and conditions of the Fragile Families Challenge, and must be accompanied by open source code and a data file containing predictions.

Submissions for the special issue must be received through the Socius online submission platform by ~~Sunday, October 1, 2017~~ Monday, October 16 at 11:59pm ET. If you have any questions about the special issue, please email fragilefamilieschallenge@gmail.com.

FAQ:

Do I need to describe an approach to predicting all six outcome variables in order to submit to the special issue?

No. We will happily consider papers that focus on one specific outcome variable.

Do I need to have a low mean-squared error in order for my paper to be published?

No. Predictive performance in the held-out dataset is only part of what we will consider. For example, a paper that clearly shows that many common strategies were not very effective would be considered a valuable contribution.

What if I can’t afford the Article Processing Charge?

Socius, like most open access journals, has an Article Processing Charge. This charge is required to keep Socius running, and it is in line with the charges at other open access journals. However, we strongly believe that the Article Processing Charge should not be a barrier to scientific participation. Therefore, the Fragile Families Challenge project will pay the Article Processing Charge for all accepted articles submitted by everyone except for tenure-track (or equivalent) faculty working in universities in OECD countries. In other words, we will cover the Article Processing Charge for undergraduates, graduate students, post-docs, and people working outside of universities. Further, we will pay the Article Processing Charge for all tenure-track (or equivalent) faculty working in universities outside the OECD.

If for any reason you think that the Article Processing Charge may be a barrier to your participation, please send us an email and we will try to find a solution: fragilefamilieschallenge@gmail.com.

How will you decide what manuscripts to accept for publication?

Articles in Socius are judged by four criteria: Accuracy, Novelty, Interest, and Presentation. In the case of this special issue, these criteria will be judged by the editors of the special issue, with feedback from reviewers and the editors of Socius. For the purposes of this special issue, here is how these criteria will be interpreted:

Accuracy: The key question is whether this analysis was conducted appropriately and accurately. Were the techniques used in the manuscript performed and interpreted correctly? Do the claims in the manuscript match the evidence provided?
Novelty: The key question is whether the manuscript will be novel to some social scientists or some data scientists. Because projects like the Fragile Families Challenge are not yet common, we expect that most submitted manuscripts will be somewhat novel.
Interest: The key question for the editors is whether the manuscript will be interesting to some social scientists or some data scientists. Will some people want to read this paper? Does it advance understanding of the Fragile Families Challenge and related intellectual domains?
Presentation: The key question is whether this manuscript communicates effectively to a diverse audience of social scientist and data scientists. We will also assess whether the figures and tables are presented clearly and whether the manuscript makes appropriate use of the opportunity for supporting online information. Because these manuscripts will be short, we expect that the supporting online information will play a key role.

Who is the audience for these papers?

All papers should be written for a general scientific audience that will include both social scientists and data scientists (broadly defined). In other words, when writing your paper you should imagine an audience similar to the audience at journals such as Science and Proceedings of the National Academies of Sciences (PNAS). We would recommend reading some articles from these journals to get a sense of this style. Manuscripts that use excessive jargon from a specific field will be asked to make revisions.

Manuscripts should follow the length guidelines of a Report published in Science: 2,500 words, with up to 4 figures or tables. Additional materials should be included in supporting online materials. We will consider articles that deviate from these guidelines in some situations. Other aspects of the manuscript format will follow standard Socius rules.

Should we describe the Fragile Families Challenge in our paper?

No. There is no need to describe the Challenge in your paper. The special issue will have an introductory article describing the Challenge and data. You should assume that your readers will already have this background information.

Will the articles go through peer review?

Absolutely. All manuscripts will be reviewed by at least two people. Possible reviewers include: members of the board of the Fragile Families Challenge, qualified participants in the Challenge, members of the general reviewer pool at Socius, and other qualified researchers.

What are the requirements for the open source code?

The code must take the Fragile Families Challenge data files as an input and produce (1) all the figures and tables in your manuscript and supporting online materials and (2) your final predictions. The code can be written in any language (e.g., R, stata, Python). The code should be released under the MIT license, but we will consider other permissive licenses in special situations.

How long will the review process take?

We don’t know exactly, but we are excited about having these results in the scientific literature as quickly as possible. Therefore, we will work as quickly as possible while maintaining the quality standards of the Fragile Families Challenge and Socius.

Will I have access to the holdout data when writing my paper? (added July 20, 2017)

No, but we will allow you to request scores for your models on the holdout as described in this blog post.

Will I have access to the Challenge data when writing my paper? (added July 27, 2017)

Yes. If you will submit to the Special Issue you can continue to use the Challenge data until the Special Issue is published. If you are not submitting to the Special Issue, then you should delete the Challenge data file on August 1. Finally, participants who want to continue to do non-Challenge related research with the Fragile Families and Child Wellbeing Study can, at any time, apply for access to the core Fragile Families data by following the instructions here: http://www.fragilefamilies.princeton.edu/documentation.

What if I want to submit to the special issue but I can’t exactly reproduce my submission to the Challenge? (added September 23, 2017)

Everyone in the Challenge was supposed to uploaded their code. But there are several reasons why they might not be able to use their code to reproduce their submission such as forgetting to set their seed or changes to the packages that were used in the submission (if you are interested, here are some general tips for promoting reproducibility Sandve et al (2013) “Ten Simple Rules for Reproducible Computational Research.”)

If there is a tension between making your paper reproducible and making it match the submission to the Challenge exactly, you should opt to make your paper reproducible. If the code and predictions that you submit with your paper don’t exactly match what you submitted to the Challenge, you should include a note in your supporting online material explaining these differences and why they occurred. If this note will require addition information from us—such as the score of your reproducible results in the leaderboard data—we will provide it to you. We are happy to help you with these issues on a case-by-case basis.

I have another question, how can I ask it?

Send us an email: fragilefamilieschallenge@gmail.com.

Ian Lundberg

June 17, 2017

Helpful idea: Read prior research

Uncategorized No comments

Not an expert in child development, poverty, or family sociology? Participants often wonder how they can contribute if they have no prior knowledge of these fields. Luckily, there are a few resources to bring you up to speed quickly!

Fact sheet

The Fragile Families and Child Wellbeing Study (FFCWS) Fact Sheet can quickly introduce the key findings from the broader FFCWS. For instance, the study discovered that “single” parenthood is a bit of a misnomer; about half of the unmarried parents in the sample were actually living together when the child was born! Yet many of these couples subsequently separated.

Research briefs

Looking for mored detailed information on a particular subfield? The Fragile Families Research Briefs provide accessible summaries of cutting edge research using the data.

Publication collection

Want to know how social scientists are using the data right now? The Fragile Families publication collection lists hundreds of published articles and working papers using the Fragile Families and Child Wellbeing Study. If you want to see how social scientists have used the data and get ideas for variables you may want to include in your models, the publication collection is a good place to start.

Other publications

A more exhaustive list of published resources is available here.

Helpful ideas series

This is the first in a series of blog posts with helpful ideas to help you build better models – look for more to come soon! For email notifications when we make new posts, subscribe in the box at the top right of this page.

Ian Lundberg

June 16, 2017

Getting started quickly in the Fragile Families Challenge

Uncategorized No comments

Want to build your first submission to the Fragile Families Challenge in an hour? In this post, we’ll tell you the trick to getting started quickly: the constructed variables.

If you’ve never worked with the Fragile Families data before it can seem daunting. The background file contains 12,943 variables (columns) for 4,242 children (rows), but 56% of the cells in this matrix are missing! Participants often begin by trying to read all the documentation, clean all of the variables, and impute reasonable values for the missing cells. This quickly becomes demoralizing. What else can you do?

Our overall recommendation is to begin with the constructed variables. These 600 variables were “constructed” by the Fragile Families research staff in order to help future researchers, and they were constructed based on multiple reports in order to reduce missing data. For example, the variable cm1relf consolidates the key information from 5 questions asked of the mother about her relationship with the father at the birth of the child. The constructed variables are a great place to start because they:

represent constructs social scientists believe to be important
have very little missing data
are easy to identify because they begin with the letter c (i.e. cm1ethrace is constructed wave 1 mother’s ethnicity and race)

There are a small number of exceptions to this convention. For instance, the variable t5tint is a constructed variable indicating whether the teacher was interviewed in wave 5. However, the vast majority of constructed variables begin with c.
When we say that constructed variables have little missing data, this statement is restricted to constructed variables that have some data all. In other words, there are some constructed variables are all NA in the Challenge file (e.g., cm1tdiff).

These constructed variables are more fully documented on p. 13-20 of the general study documentation. Further, they are also summarized in this participant-generated open-source dictionary.

A good strategy to get started quickly is to pick some constructed variables, build a very simple model, and get yourself on the leaderboard! You can always build up from there. Participants often begin with cm1ethrace, cf1ethrace, cm1edu, cf1edu, and cm1relf.

Even if you start with the constructed variables, you will be frustrated by missing data. As summarized in our blog post, there is no perfect solution to this problem. We recommend the following workflow:

Start with a small fraction of the total variables. Focus on imputing the missing values for this subset, rather than for all variables in the entire file.
Decide how to address informative missing values (i.e. -6, valid skip). For categorical variables, you might treat valid skips as their own category.
Impute remaining missing values with mean or median imputation. We know that mean or median imputation aren’t great, but they are a reasonable starting point, and you can move to model-based imputation later.
Fit models on your imputed dataset.

Ian Lundberg

June 16, 2017

Constructed variables – data dictionary

Uncategorized No comments

We are happy to announce that Challenge participants Aarshay Jain, Bindia Kalra, and Keerti Agrawal at Columbia University have created a new resource that should make working the Challenge data more efficient. More specifically, they created an alternative data dictionary for the constructed variables (FFC_Data_Dictionary.xlsx). They have made it available open-source here.

Their dictionary:

Summarizes constructed variable prefixes and suffixes
Categorizes questions by the respondent to and subject of the question
Provides examples of questions from a variety of substantive categories

As discussed in our blog post on getting started quickly, the constructed variables are a good place to start when choosing variables to include in your model. These variables are summarized on p. 13-20 of the general study documentation.

The official Fragile Families and Child Wellbeing Study site is still the authoritative source of documentation, but we hope this open source contribution helps you more quickly understand the variables available and how to find them.

The open-source movement is exciting because it unlocks the power of what we can do by collaboration. Much like a Wikipedia page benefits when hundreds of people view it and think about improvements they could make, so too will the open-source resources for the Fragile Families Challenge shine if others get involved when they think of possible improvements. If you think you can make this data dictionary better, please jump in, open-source your new version, and let us know so we can publicize it! In fact, Aarshay, Bindia, and Keerti would love to see these kind of improvements. Likewise, we welcome any other open-source contributions that you think might make the Challenge better.

Many thanks to Aarshay, Bindia, and Keerti for making it easier for others to use the data!

Matt Salganik

June 16, 2017

getting started workshop, Princeton and livestream

Uncategorized No comments

We will be hosting a getting started workshop at Princeton on Friday, June 23rd from 10:30am to 4pm. This workshop will also be livestreamed at this link so even if you can’t make it to Princeton you can still participate.

During the workshop we will

Provide a 45 minute introduction to the Challenge and the data (slides)
Provide food and a friendly collaborative environment
Work together to produce your first submission

In addition to people just getting started, we think the workshop will be helpful for people who have already been working on the Challenge and who want to improve their submission. We will be there to answer questions both in person and through Google Hangouts during the entire event.

Logistics:

When: Friday, June 23rd from 10:30 to 4pm ET
Where: Julis Romo Rabinowitz Building, Room 399 and streaming here
RSVP: If you have not already applied to the Challenge, please mention the getting started workshop in your application. If you have already applied, please let us know that you plan to attend (fragilefamilieschallenge@gmail.com). We are going to provide lunch for all participants, and we need to know how much food to order.

This getting started workshop will be a part of the Summer Institute for Computational Social Science.

Ian Lundberg

June 10, 2017

Getting started with Stata

Uncategorized No comments

This post summarizes how to work on the Fragile Families Challenge data in Stata.

We only cover the basics here. For more detailed example code, see our open-source repository, thanks to Jeremy Freese.

How do I import the data?

Before loading the data, you may need to increase the number of variables Stata will hold.
set maxvar 13000

Then, change your working directory to the place where the file is located, using
cd your_directory.

Load the training outcomes
import delimited train.csv, clear case(preserve) numericcols(_all)
Two options there are critical:

The case(preserve) option ensures that the case of variable names is preserved. Omitting this option will produce errors in your submission since capitalization in variable names is required (i.e. challengeID), but Stata’s default makes all variable names lower case.
The numericcols(_all) option ensures that the outcomes are read as numeric,
rather than as character strings.

Merge the background variables to that file using the challengeID identifier.
merge 1:1 challengeID using background.dta

You will see that 2,121 observations were in both datasets. These are the training observations for which we are providing the age 15 outcomes.
You will also see that 2,121 observations were only in the using file, since the background variables but not the outcomes are available for these cases. These are the test cases on which your predictions will be evaluated.

If you have an older version of Stata, you may not be able to open the .dta file with metadata. You can still load the background file from the .csv format. To do that, you should first load the .csv file and save it in a .dta format you can use. Then, follow the instructions above.
import delimited background.csv, clear case(preserve) save background.dta, replace
Again, note the important case(preserve) option!

How do I make predictions?

If your model is linear or logistic regression, then you can use the predict function.
regress gpa your_predictors predict pred_gpa, replace
Then the variable gpa_pred has your predictions for GPA. You can do this for all 6 outcomes.

How do I export my submission?

This section assumes your predicted values are named pred_gpa, pred_grit, etc. First, select only the identifier and the predictions.
keep challengeID pred_*
Then, rename all your predictions to not have the prefix pred_
local outcomes gpa grit materialHardship eviction layoff jobTraining foreach outcome of local outcomes { rename pred_`outcome' `outcome' }
Finally, export the prediction file as a .csv.
export delimited using prediction.csv, replace
Finally, bundle this with your code and narrative description as described in the blog post on uploading your contribution!

Ian Lundberg

May 25, 2017

Stata .dta file with metadata

Uncategorized No comments

In response to many requests from Challenge participants, we are now able to provide a .dta file in Stata 14 format. This file contains metadata which we hope will help participants to find variables of interest more easily.

Contents of the .dta file

If you have been working with our background.csv file and the codebooks available at fragilefamilies.princeton.edu, then this .dta file provides the same information you already had, but in a new format.

Each variable has an associated label which contains a truncated version of the survey question text.
For each categorical variable, the text meaning of each numeric level of that variable is recorded with a value label.

You are welcome to build models from the .csv file or from the .dta file.

Distribution of the .dta file

All new applicants to the Challenge will receive a zipped folder containing both background.csv and background.dta.

Anyone who received the data on or before May 24, 2017 may send an email to fragilefamilieschallenge@gmail.com to request a new version of the data file.

Using the .dta file

Stata users can easily load the .dta file, which is in Stata format.

We have prepared a blog post about using the .dta file in R and about using the .dta file in Python to facilitate use of the file in these other software packages.

We hope the metadata in this file enables everyone to build better models more easily!

Matt Salganik

May 25, 2017

Using .dta files in R

Uncategorized 2 comments

We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.

There are many ways to read .dta files into R. In this post I’ll use haven because it is part of the tidyverse.

Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):

library(tidyverse) library(haven) ffc.stata <- read_dta(file = "background.dta") ffc.csv <- read_csv(file = "background.csv")

One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.

Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.

is.na(ffc.stata[(ffc.stata$challengeid==1104), "m1b9b11"]) is.na(ffc.csv[(ffc.csv$challengeID==1104), "m1b9b11"])

If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.

Notes:

The read_dta function in haven is a wrapper around the ReadStat C library.
The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.

Alex Kindel

May 25, 2017

Using .dta files in Python

Uncategorized 4 comments

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.

Notes

Documentation for pandas.read_stata() is available here.
The read_stata() function accepts either a file path or a read buffer (as above).
The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

Matt Salganik

May 22, 2017

Machine-Readable Fragile Families Codebook

Uncategorized No comments

The Fragile Families and Child Wellbeing study has been running for more than 15 years. As such, it has produced an incredibly rich and complex set of documentation and codebooks. Much of this documentation was designed to be “human readable,” but, over the course of the Fragile Families Challenge, we have had several requests for a more “machine-readable” version of the documentation. Therefore, we are happy to announce that Greg Gundersen, a student in Princeton’s COS 424 (Barbara Engelhardt’s undergraduate machine learning class), has created a machine-readable version of the Fragile Families codebook in the form of a web API. We believe that this new form of documentation will make it possible for researchers to work with the data in unexpected and exciting ways.

There are three ways that you can interact with the documentation through this API.

First, you can search for words inside of question description field. For example, imagine that you are looking for all the questions that include the word “evicted”. You can find them by visiting this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/?q=evicted

Just put your search term after the “q” in the above URL.

The second main way that you can interact with the new documentation is by looking up the question data associated with a variable name. For example, want to know what is “cm2relf”? Just visit:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2relf

Finally, if you just want all of the questionnaire data, visit this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/

A main benefit of a web API is that researchers can now interact with the codebooks programmatically through URLs. For example, here is a snippet of Python 2 code that fetches the data for question “cm2mint'”:

>>> import urllib2 >>> import json >>> response = urllib2.urlopen('https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2mint') >>> data = json.load(response) >>> data [{u'source file': u'http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb1.txt', u'code': u'cm2mint', u'description': u'Constructed - Was mother interviewed at 1-year follow-up?', u'missing': u'0/4898', u'label': u'YESNO8_mw2', u'range': [0, 1], u'unique values': 2, u'units': u'1', u'type': u'numeric (byte)'}]

We are very grateful to Greg for creating this new documentation and sharing it with everyone.

Notes:

Greg has open sourced all his code, so you can help us improve the codebook. For example, someone could write a nice front-end so that you can do more than just interact via the url.
The machine-readable documentation should include the following fields: description, source file, type, label, range, units, unique values, missing. If you develop code that can parse some of the missing fields, please let us know, and we can integrate your work into API.
The machine-readable documentation includes all the documentation that was in text files (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_dad_cb5.txt). It does not include documentation that was in pdf format (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_hv_cb5.pdf).
When you visit these urls, what gets returned is in JSON format, and different browsers render this JSON differently.
If there is a discrepancy between the machine-readable codebook and the traditional codebook, please let us know.
To deploy this service we used Flask, which is an open source project. Thank you to the Flask community.

Ian Lundberg

May 12, 2017

Final submission deadline

Uncategorized No comments

The final submission deadline for the Fragile Families Challenge will be
2pm Eastern Daylight Time on Tuesday, August 1, 2017.

While it is tempting to stay open indefinitely to continue collecting high-quality submissions, closing is important so that we can conduct the targeted interviews within a reasonable timespan after the original interview, and so that the Fragile Families and Child Wellbeing Study can make the full data available to researchers.

Ian Lundberg

May 5, 2017

How much should I trust the leaderboard?

Uncategorized No comments

The leaderboard on the Fragile Families Challenge submission site is often the first thing participants focus on. It is therefore important to understand!

Why do we like the leaderboard?

The leaderboard:

shows rankings in real-time, motivating better submissions
demonstrates that models that predict well in the training data do not necessarily perform well in an out-of-sample test
makes the Challenge more fun!

Understanding the data split

However, the leaderboard is only a small portion of the overall data. In fact, the observations (rows) in the data are split into:

4/8 training data
1/8 leaderboard data
3/8 test data

As discussed in our blog post on evaluating submissions, final evaluation will be done on a separate set of held-out test data – the 3/8 portion referenced above. This means all awards (including the progress prizes) will be conducted on the test data, not the leaderboard. Likewise, our follow-up interviews will focus on the test set observations that were not used for training. Separation between the leaderboard and test sets is important; the leaderboard set isn’t truly held out since everyone receives repeated feedback from this set throughout the challenge!

Implications for strategy

What does this mean for your ideal strategy? How can you best make use of the leaderboard?

The leaderboard gives an instant snapshot of your out-of-sample performance. This can be useful in evaluating your model, much as splitting your own training set can be helpful.
However, over-fitting to the leaderboard will only hurt your score in the final test set evaluation
Leaderboard scores are noisy measures of generalization error because they are based on a small sample. So, even as a measure of generalization error, the leaderboard should be interpreted cautiously!

In summary, we expect some models to perform better in the final evaluation than the leaderboard suggests, due to random noise. Likewise, some models will look good on the leaderboard but perform poorly in the final evaluation because they got lucky in the leaderboard. Some submissions may even under-perform in the final evaluation because they made too many modeling adjustments to fit closely to idiosyncrasies of the leaderboard!

Your final evaluation will not be based on the leaderboard, so you are best advised to use it cautiously as one (noisy) bit of information about your generalization error.

Alex Kindel

April 27, 2017

Progress report: COS424 @ Princeton

Uncategorized 1 comment

As we near the midpoint of the Challenge, we are excited to report on the progress of our first cluster of participants: student teams in COS424, the machine learning fundamentals course at Princeton. You can find some schematic analyses of their performance over time, modeling strategies, and more here. Some of the students have open-sourced their code for all participants to use and learn from; you can find that code here.

Thanks to all the COS424 students for their awesome contributions!

Matt Salganik

April 26, 2017

Progress prizes

Uncategorized 11 comments

We were glad to receive many submissions in time for the progress prizes! As described below, we have downloaded these submissions and look forward to evaluating them and determining the best submissions at the end of the Challenge.

We are excited to announce that progress prizes will be given based on the best-performing models on Wednesday May 10, 2017 at 2pm Eastern time. We will not announce the winners, however, until after the Challenge is complete.

Here’s how it will work. On May 10, 2017 at 2pm Eastern time, we will download all the submissions on the leaderboard. However, we will not calculate which submission has the lowest error on the held-out test data until after the Challenge is complete. The reason for this delay is that we don’t want to reveal any information at all about the held-out test data until after the Challenge is over.

From the submissions that we have received by May 10, 2017 at 2pm Eastern Time, we will pick the ones that have the lowest mean-squared error on the held-out test data for each of the six outcome variables. In other words, there will be one prize for the submission that performs best for grit, and there will be another prize for the submission that performs best for grade point average, and so on.

All prize winners will be invited to participate in the post-Challenge scientific workshop at Princeton University, and we will cover all travel expenses for invited participants. If the prize-winning submission is created by a team, we will cover all travel expenses for one representative from that team.

We look forward to seeing the submissions.

Ian Lundberg

April 17, 2017

Getting started workshop at PAA

Uncategorized No comments

The Fragile Families Challenge is excited to host a getting started workshop at the Annual Meeting of the Population Association of America in Chicago!

We will

Present a few slides introducing the Challenge (SLIDES HERE)
Provide food and a friendly collaborative environment
Work together to produce your first submission!

When: 10am – 2pm, Thursday, April 27
Where: Hilton Chicago, Conference Room 4G (DIRECTIONS: Come to the 4th floor and we’re the room way down at the end.)
Who: You! Anyone involved in social science and/or data science can make an important contribution.
RSVP: Mention you’re coming to our PAA workshop when you apply to participate!

We hope to see you there!

Alex Kindel

April 12, 2017

Code Validation

Uncategorized No comments

As part of the challenge, we’re interested in understanding and learning from the strategies participants are using to predict outcomes in the Fragile Families data. One major goal of the challenge is to learn how these strategies evolve and develop over time. We think that a more systematic understanding of how social scientists and data scientists think with data has the potential to better inform how statistical analysis is done. To do this analysis, we use the code and narrative analysis included with each submission.

Recently, we updated the code that evaluates predictions to ensure that groups don’t forget to include their code in their submissions.

What does this mean for me?

Make sure that your directory contains all of the code you used to generate your predictions.
It’s not a problem if the code is in multiple/un-executable scripts. When we look over code submissions, we don’t execute the code.
If you run into an error when you submit your predictions that says you’ve forgotten your code, but your submission does actually contain the code you’ve been using, let us know as soon as possible!

Ian Lundberg

March 6, 2017

Reading survey documentation

Uncategorized No comments

The Fragile Families survey documentation can be confusing. We’ve put together this blog post so you can find out what variables in the Challenge data file mean.

Using the Fragile Families website

The first place to go to find out what a given variable represents is the Fragile Families and Child Wellbeing Study website: http://www.fragilefamilies.princeton.edu/

Once there, click the “Data and Documentation” tab.

This brings you to the main documentation for the full study. On the left, you will see a set of links that will take you to the documentation for particular waves of the data.

Clicking on the link for Year 9 (Wave 5) as an example, we see the following page of documentation for this survey.

Let’s look at the mother questionnaire and codebook. On page 5 of the questionnaire, you will see the following question:

In the corresponding codebook, we see the count of respondents who gave each answer:

Two things are worth noting here.

The question referred to in the questionnaire as A3B is called m5a3b in the codebook. This is because the prefix “m5” indicates that this question comes from the mother wave 5 interview.
Lot’s of people got coded -6 for “Skip.” Looking back at the questionnaire, we can see why they were skipped over this question: it was only asked of those for whom “PCG = NONPARENT AND RELATIONSHIP = FOSTER CARE.” For children not in foster care, this question would not be meaningful, so it wasn’t asked.

In general, the questionnaires are the best source for information about why certain respondents get skipped over questions. For more information on all the ways data can be missing, see our blog post on missing data.

Structure of the variable names

The general structure of the variable names is [prefix for questionnaire type][wave number][question number].

What are all the variable prefixes?

The most common prefixes are:

Prefix

Meaning

Mother

Father

h or hv

Home visit

Primary caregiver

Kid (interview with the child)

kind_

Kindergarten teacher

Teacher

ffcc_[something]

Child care surveys. For a full list of the [something] see this documentation.

Constructed variables: An additional prefix

Some variables have been constructed based on responses to several questions. These are often variable that are particularly relevant to the models many researchers want to estimate. These variables add the additional prefix c to the front of the variable name. For instance, cm1ethrace indicates constructed mother’s wave 1 race/ethnicity.

What are the wave numbers?

It’s easy to talk about the questionnaires by the rough child ages at which they were conducted. This is how the documentation website is organized. However, the variable names always refer to wave numbers, not child ages. It’s important not to get confused on this point. The table below summarizes the mapping between wave numbers and approximate child ages.

Wave number

Approximate child age

0, often called “baseline”

What are the question numbers?

Question numbers typically begin with a letter and a number, i.e. a3.

In questionnaires, questions are referred to by question number alone.
In codebooks, questions are referred to by a prefix and then a question number.

How do I find a question I care about?

You might want to find a particular question. For instance, when modeling eviction or material hardship at age 15, you might want to include the same measures collected at age 9. If you ctrl+F or cmd+F for “evicted” in the mother or father codebook or questionnaire at age 9, you will find these variables. In this case, they are m5f23d and f5f23d.

Ian Lundberg

March 6, 2017

GPA

Uncategorized No comments

GPA measures academic achievement.

We want to know:

What helps disadvantaged children to beat the odds and succeed academically?
What derails children so that they perform unexpectedly poorly?

Survey question

How we cleaned the data

Our measure of GPA is self-reported by the child at approximately age 15. We marked as NA the GPAs of children who were not interviewed, reported no grade, refused to answer, did not know, or were homeschooled, for any of the four subjects. For children with valid answers, we averaged the responses for all four subjects, then subtracted this number from 5 to produce an estimate of child GPA ranging from 1 to 4. In our re-coded variable, a GPA of 4.0 indicates that the child reported straight As, while a GPA of 1.0 indicates that the child reported getting all grades of D or lower.

Distribution in the training set

Scientific motivation

Helping kids “beat the odds” academically is a fundamental goal of education research; academic success can be the key to breaking the cycle of poverty. Free public education is often referred to as a great equalizer, yet children who grow up in disadvantaged families consistently underperform their more affluent peers on average.

However, the average is not the whole story. Some kids do well despite being expected to do poorly. In fact, the amount of unexplained variation in educational achievement is enormous: social science models typically have R-squared values of 0.2 or less [this is based on our informal experience with the literature, not a systematic search]. The poor predictive performance of social science models of educational attainment has long been known. In the now-classic 1972 book Inequality: A Reassessment of the Effect of Family and Schooling in America, Harvard social scientist Christopher Jencks argued that random chance played a larger role than measured family background characteristics in determining socioeconomic outcomes.

While social scientists have learned some about what helps children succeed academically in the decades since 1972, a huge proportion of the variance remains buried in the error term of regression models. Is this term truly random chance, or is there “dark matter” out there in the form of unmeasured but important variables that help some kids to beat the odds?

By submitting a model for GPA at age 15, you help us in our quest to find this dark matter. Based on our collaborative model combining all of the individual submissions, we will identify our best guess as a scientific community about how children are expected to perform at age 15. Then, we will identify a subset of children performing much better and worse than expected. We will interview these children to answer the question: what unmeasured variables are common to the kids who are beating the odds, which we do not observe among the children who are struggling unexpectedly?

When you participate, you help us target interviews at the children whose outcomes are least well explained by our measured variables. These children are best-positioned for exploratory qualitative research to uncover unmeasured but important factors. Interviews may help us learn how some kids beat the odds, these results may drive future deductive research to evaluate the causal effect of these unmeasured variables, and ultimately we hope that policymakers can intervene on the “dark matter” we find in order to improve the lives of other disadvantaged children in the future.

Ian Lundberg

March 6, 2017

Grit

Uncategorized No comments

Grit is a measure of passion and perseverance. It predicts success in many domains. The causes of grit remain unknown.

We want to know: What makes some kids unexpectedly grittier than others in adolescence?

Survey questions

The survey questions are adapted from the grit scale proposed by Duckworth, Peterson, Matthews, and Kelly (2007).

How we cleaned the data

Our measure of grit is based on the four questions above, as answered by the child at approximately age 15. These items were part of a longer battery of questions capturing a wider range of attitudes, emotions, and outlooks. Children who refused any of the four questions or didn’t know how to answer were coded as NA, as were children who did not complete the age 15 interview. For children with four valid answers, we averaged the answers and subtracted the result from 5. This created a continuous scale ranging from 1 to 4. The way we have recoded it, a high score on our variable indicates more grit.

Distribution in the training set

Scientific motivation

Do you keep working when the going gets tough? If so, you probably have a lot of grit.

University of Pennsylvania psychologist and MacArthur “Genius” award winner Angela Duckworth has found that grit predicts all kinds of measures of success: persistence through a military training program at West Point, advancement through the Scripps National Spelling Bee, and educational attainment, to name a few. Duckworth’s work has reached the general public through her TED talk and NY Times bestseller Grit: The Power of Passion and Perseverance.

While it is clear that grit predicts success, it is less clear what causes some people to be grittier than others. How can we help more disadvantaged children to exhibit grit?

A few researchers have begun to examine this question. In their book Coming of Age in the Other America, social scientists Stefanie DeLuca (Johns Hopkins University), Susan Clampet-Lundquist (St. Joseph’s University), and Kathryn Edin (Johns Hopkins University) argue that kids growing up in impoverished urban neighborhoods are often inspired to have grit when they develop passion for an “identity project”: a personal passion that gives them something to aspire toward beyond the challenges of the present day. This ethnographic work exemplifies how qualitative social science research may be able to uncover previously unmeasured sources of grit.

How much more could we learn if qualitative interviews were targeted at the kids best positioned to be informative about unmeasured sources of grit? By participating, you can help us build a community model for grit measured in adolescence. The combined submissions of all who participate will identify our common agreement about the amount of grit we expect to see in the Fragile Families respondents, given all of their childhood experiences from birth to age 9. By interviewing children who have much more or much less grit than we all expect, we will uncover unmeasured factors that predict grit. It is our hope that these unmeasured factors can inform future deductive evaluations and ultimately policy interventions to help kids break the cycle of poverty by developing grit.

Grit is an important predictors of success, but the causes of grit are largely unknown. Be part of the solution and help us target interviews toward those best positioned to show us these unmeasured sources of grit. Apply to participate, build a model, and upload your contribution.

Ian Lundberg

March 6, 2017

Material hardship

Uncategorized No comments

Material hardship is a measure of extreme poverty.

We want to know:

What helps families to unexpectedly escape extreme poverty?
What leads families to fall into extreme poverty unexpectedly?

Survey questions

How we cleaned the data

These questions were asked of the child’s primary caregiver when the child was approximately age 15. We marked as NA material hardship for children whose caregivers did not participate in the survey, didn’t know the answer to one or more questions, or refused one or more questions. Our material hardship measure is the proportion of these 11 questions for which the child’s caregiver answered “Yes.” Material hardship ranges from 0 to 1, with higher values indicating more material hardship.

Distribution in the training set

Scientific motivation

In his 1964 State of the Union Address, President Lyndon B. Johnson declared an “all-out war on human poverty and unemployment in these United States.” In the decades since, America has taken great strides toward this goal. However, severe deprivation remains a problem today. In $2 a Day: Living on Almost Nothing in America, Johns Hopkins sociologist Kathryn Edin and University of Michigan social work professor H. Luke Schaefer bring us into the lives of American families living in the nightmare of extreme poverty.

What can be done to reduce extreme poverty? By identifying families who unexpectedly escape extreme poverty, as well as those who unexpectedly fall into it, we hope to uncover unmeasured but important factors that affect severe deprivation.

Measuring extreme poverty is hard. The material hardship scale was originally proposed in a 1989 paper by Susan Mayer and Christopher Jencks, then social scientists at Northwestern University. Rather than focusing solely on respondent’s incomes, Mayer and Jencks asked respondents about particular needs that they were unable to meet. This scale proved fruitful and captured a dimension of poverty above and beyond what was captured by income alone. With minor modifications, the material hardship scale became a standard measure in the federal Survey of Income and Program Participation (SIPP), and it has been included in several waves of the Fragile Families Study.

By participating, you help us to identify the level of material hardship that is expected at age 15 for each of the families in the Fragile Families Study. By combining all of the submissions in one collaborative model, we will produce the best guess by the scientific community of the experiences we expect for families at age 15. Undoubtedly, some families will report much more or much less material hardship than we expect. By interviewing these families, we hope to discover unmeasured but important factors that are associated with sudden dives into material hardship or unexpected recoveries.

The results of these exploratory interviews can then inform future deductive social science research and help us propose policies that could help families to escape severe deprivation. You can help us to target these interviews at the families best positioned to help. Be a part of the solution: apply to participate, build a model, and upload your contribution.

Ian Lundberg

March 5, 2017

Eviction

Uncategorized No comments

Eviction is a traumatic experience in which families are forced from their homes for not paying the rent or mortgage.

We want to know: As children transition into adulthood, does eviction cause negative outcomes?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0. We additionally coded as 1 a small group of respondents who answered in a previous question that they were evicted in the past year, and thus were skipped over this question.

Distribution in the training set

Scientific motivation

In the New York Times bestseller Evicted: Poverty and Profit in the American City, Harvard sociologist and MacArthur “Genius” award winner Matthew Desmond describes fieldwork in which he spent several years living alongside tenants being evicted in low-income Milwaukee neighborhoods. Desmond helped tenants move their things into trucks, followed landlords into eviction court, and watched as children moved from school to school while their families searched for housing. Eviction literally uproots families from their homes, and it is most prevalent among the most disadvantaged urban families. Given Desmond’s qualitative account, it is plausible that eviction may have substantial negative effects on child outcomes in early adulthood.

Emerging evidence further suggests that eviction is sufficiently prevalent to warrant policy attention. Researchers at the Federal Reserve Bank of Atlanta have examined administrative records to find that 12.2 percent of rental households were evicted and forcibly displaced in 2015 in Fulton County, GA (Raymond et al. 2016). Likewise, the Milwaukee Area Renters Study found that 13 percent of private renters experienced a forced move during the 2 years referenced in a survey questionnaire (Desmond and Schollenberger 2015). If eviction creates disadvantage for children, it is sufficiently prevalent to have wide-reaching impacts.

However, untangling cause from selection is no simple task (see our blog post on causal inference and this interview with Matthew Desmond on the topic). It is easy to show that children who experience an eviction have worse outcomes later in life; it is hard to show that these outcomes are not caused by other factors that are correlated with eviction. In a quantitative study using propensity score matching methods on earlier waves of the Fragile Families and Child Wellbeing Study, Desmond and Kimbro (2015) find that eviction is associated with negative outcomes, net of obvious sources of selection bias.

We applaud the work of all the individual research teams that have placed eviction on the table as a scientific concept of interest. However, any individual research team can only adjust for a selected group of observed covariates, and results can be sensitive to the set chosen. We ask you to contribute a model for the probability that a child experiences an eviction between the age 9 and age 15 interviews of the Fragile Families and Child Wellbeing Study, given any set of the birth to age 9 characteristics you choose to include, and any statistical model you choose to employ. Together, we will produce a collaborative propensity score model that the entire scientific community can agree upon, which is not sensitive to researcher decisions. We will then interview a subset of children who are matched on the propensity score, to assess the plausibility of the conditional ignorability assumption required for causal inference (see our blog post on causal inference). If the interview suggest that causal inference may be warranted, we will use these collaborative propensity scores to estimate the causal effect of eviction on child outcomes to be measured several years from now, when children are approximately 22 years old.

In summary, this research agenda will produce estimates of the effect of adolescent eviction on attainment during the transition to adulthood. These collaborative estimates will be robust to the decisions of individual researchers. The assumptions needed for causal inference will be validated in qualitative interviews. These steps will maximize the validity of causal inference in the absence of a randomized experiment.

To achieve these goals, we need your help. Apply to participate, build a model, and upload your contribution!

Ian Lundberg

March 5, 2017

Layoff

Uncategorized No comments

Being laid off is a sudden and often unexpected experience with potentially detrimental consequences for one’s family.

We want to know: When a caregiver is laid off, do adolescent children suffer collateral damage?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who have never worked or have not worked since the age 9 interview (in approximately the prior 6 years) were coded as NA; these respondents are not at risk for a layoff. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0.

Distribution in the training set

Scientific motivation

A steady jobs can provide financial security to a family. However, this security can be upset by plant closures, downsizing, and other economic shifts that lead caregivers to lose their jobs. In addition, some caregivers may be fired but report in a survey that they have been laid off. In any case, layoff of a caregiver could create dramatic disadvantages for adolescents nearing the transition to adulthood.

Social scientists worry about layoffs because precarious work is on the rise. In Good Jobs, Bad Jobs, University of North Carolina sociologist Arne L. Kalleberg outlines economic shifts that have made steady employment harder to come by in the United States over the past several decades. Gone are the days when workers could count on a single job to carry them throughout their careers – job changes and unexpected unemployment are now commonplace.

Social scientists also worry about layoffs because they may negatively influence child achievement. Sociologists Jennie E. Brand (UCLA) and Juli Simon Thomas (Harvard) have shown in an article published in the American Journal of Sociology that maternal job displacement reduces a child’s chances of high school and college completion by 3 – 5 percentage points, with even larger effects among those unlikely to experience job displacement and those whose mothers experienced job displacement while the child was an adolescent. When caregivers lose their jobs, children suffer collateral damage.

However, causal conclusions always depend on modeling assumptions. The propensity score matching methods used in the paper cited above assume that the model for the probability of job displacement is correctly specified, and that there are no unmeasured variables that affect job displacement and also directly affect child outcomes. To learn more on these assumptions, see our blog post on causal inference.

The Fragile Families Study follows a particularly disadvantaged sample of urban children, for whom we would especially like to know the effect of maternal layoff on adult outcomes. By participating, you help us to produce a collaborative propensity score model that combines the best of all the individual submissions into a single metric that is robust to the modeling decisions of individual researchers. This model will also help us target interviews at the children best positioned to lend suggestive evidence about the plausibility of the untestable conditional ignorability assumption required for causal inference. If this assumption seems credible after interviews, we will use our collaborative propensity scores to estimate the causal effect of caregiver layoff on child outcomes in early adulthood, once those outcomes are measured several years from now.

By participating, you can be part of an extending our body of knowledge to provide maximally robust causal evidence with observational data about the effect of caregiver layoffs on child outcomes in a disadvantaged urban sample. Results will inform policy changes about whether support for steady caregiver employment could help disadvantaged children.

Be a part of the solution. Apply to participate, build a model, and upload your contribution.

Ian Lundberg

March 5, 2017

Job training

Uncategorized No comments

Policymakers often propose programs to retrain the workforce to be able to contribute in a 21st century economy.

We want to know: Do job skills programs utilized by caregivers yield collateral benefits for disadvantaged children?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Distribution in the training set

Scientific motivation

One way to raise people’s standard of living is to raise their human capital: the skills that promote productive participation in the labor force. Human capital investments are perhaps more important now than ever before given rapid globalization and computerization of the economy. Does participation in job training programs designed to build computer, language, or other skills improve the well-being of families? When caregivers participate in these programs, do children benefit indirectly?

Social scientists have long been interested in policy interventions to promote employment. This research has also been closely tied to the development of statistical methods for causal inference with observational data. In the 1970s, the National Supported Work Demonstration (NSW) randomly assigned some disadvantaged, non-employed workers to a job training program that included guaranteed employment for a short period of time. Others were randomly assigned to a control condition. The treatment led to measurable increases in earnings in subsequent years, suggesting that job training might be useful.

University of Chicago economist Robert LaLonde saw a new use for these data. Given that experimental results provided the “true” causal effect of job training on earnings, LaLonde wanted to know whether econometric techniques that statistically adjust for selection bias could recover this “true” effect in a non-experimental setting. In general, these statistical adjustments failed to recapture the “true” effect, and LaLonde’s 1986 paper became highly cited as evidence of the extreme difficulty of drawing causal inferences from observational data.

However, the story did not end there. About the same time, a pair of statisticians developed a new method for identifying causal effects: propensity score matching. In an enormously influential 1983 paper, Paul R. Rosenbaum (then of the University of Wisconsin) and Donald B. Rubin (then of the University of Chicago) showed that the average causal effect of a binary treatment on an outcome could be identified by matching treated units with untreated units who had similar probabilities of treatment given observed pre-treatment characteristics. The Rosenbaum and Rubin theorem held only in a sufficiently large sample and only when one estimated the propensity score correctly without omitting any important variables that might affect the treatment and directly affect the outcome. Despite these limitations, the key idea stuck: under certain assumptions, one can use observational data to try to re-create the type of data one would get in a randomized experiment where background characteristics no longer determine treatment assignment.

Empowered with propensity scores, two other statisticians reassessed LaLonde’s findings: could propensity score methods recover the experimental benchmark in the job training example? Raheev H. Dehejia (then of Columbia University) and Sadek Wahba (then of Morgan Stanley) found that they could. In two highly-cited papers (paper 1 and paper 2), they demonstrated that propensity score methods came much closer to recovering the experimental truth than the econometric approaches used by LaLonde.

The saga of job training and causal inference has continued to the present day. For instance, a 2002 paper by economists Jeffrey Smith (then of the University of Maryland) and Petra Todd (University of Pennsylvania) demonstrated that propensity score methods can be highly sensitive to researcher decisions. Since then, numerous statisticians and social scientists have used the job training example to demonstrate the usefulness of new matching methods: entropy balancing (Hainmueller 2012), genetic matching (Diamond and Sekhon 2013), and the covariate balancing propensity score (Imai and Ratkovic 2014), to name a few.

Be part of the next step

Clearly there is a lot of interest in human capital formation through job training. There is also interest in methods to infer causal effects from observational data. How does the Fragile Families Challenge fit in?

A slightly different treatment

The LaLonde (1986) paper and subsequent studies focused on an intensive job training program that connected non-employed individuals with jobs. The “treatment” variable which you will predict is much milder: participation in any classes to improve job skills, such as computer training or literacy classes. Respondents who enroll in these classes are not necessarily non-employed.

A robust propensity score model

One piece of conventional wisdom about propensity score methods is that one should be careful about selecting the pretreatment variables to include in the model, and one must model their relationship to the treatment variable appropriately. This is where you can help! Together we will build a highly robust community model for the probability of job training. This community model will take all of our best ideas and create one product on which we can all agree.

Specifying models before outcomes occur

A second piece of conventional wisdom of propensity score modeling is that it allows one to conduct all modeling and matching before even looking at the outcome variable. In our case, the ultimate outcome variables are not yet measured: we will examine the effect of caregiver job training on child outcomes in early adulthood. These outcomes will be measured several years from now, long after we lock in our community propensity score model.

Evaluating assumptions

All covariate-adjustment methods to draw causal inferences from observational data rely on the assumption of conditional ignorability (for more about this assumption, see our blog post about causal inference). Through targeted interviews with caregivers, we can provide suggestive evidence as to whether the conditional ignorability assumption holds.

You can help

Be a part of the next step in observational causal inference to evaluate the effect of job training programs. Apply to participate, build a model, and upload your contribution.

Ian Lundberg

March 4, 2017

Blog posts

Uncategorized 3 comments

In addition to the general Fragile Families documentation, the following blog posts provide more details about the data and the scientific goals of the project.

Blog posts about outcomes
- GPA
- Grit
- Material hardship
- Eviction
- Layoff
- Job training
Apply to participate
Build a model

Getting started quickly

Helpful ideas

Participant-generated resources

Timeline of challenge

Progress report from Cos 424 pilot at Princeton
Progress prizes (May 10, 2017, 2pm Eastern time)
Final submission deadline (August 1, 2017, 2pm Eastern time)
Deadline to submit a manuscript to a special issue of Socius (October 1, 2017, 11:59pm Eastern time)

Weekly office hours

Scientific goals
- Discover unmeasured and important factors
- Prioritize issues for intervention: Causal inference
- Compare modeling approaches

Ian Lundberg

February 23, 2017

Weekly office hours

Uncategorized No comments

From 3:30-4:30pm Eastern Daylight Time every Wednesday, one of us will be at the computer to answer your questions. At those times, please video call us via Google Hangout at fragilefamilieschallenge@gmail.com.

For more immediate feedback from the full community of users, post on our discussion forum for the Fragile Families Challenge.

For concerns you do not wish to share with the entire community, you can also contact us privately.

Ian Lundberg

February 22, 2017

Discovering unmeasured factors

Uncategorized No comments

Beating the odds

Despite coming from disadvantaged backgrounds, some kids manage to “beat the odds” and achieve unexpectedly positive outcomes. Meanwhile, other kids who seem on track sometimes struggle unexpectedly. Policymakers would like to know what variables are associated with “beating the odds” since this could generate new theories about how to help future generations of disadvantaged children.

Once we combine all of the submissions to the Fragile Families Challenge into one collaborative guess for how children will be doing on each outcome at age 15, we will identify a small number of children doing much better than expected (“beating the odds”), and another set who are doing much worse than expected (“struggling unexpectedly”). By interviewing these sets of children, we will be well-positioned to learn what factors were associated with who ended up in each group.

What we learn in these interviews will affect the questions asked in future waves of the Fragile Families Study, and possibly other studies like it. By combining quantitative models with inductive interviews, the Fragile Families Challenge offers a new way to improve surveys in the future and expand the range of social science theories. In the remainder of this blog, we discuss current approaches to survey design and the potential contribution of the Fragile Families Challenge.

Deductive survey design: Evaluating theories

Social scientists often design surveys using deductive approaches based on theoretical perspectives. For instance, economists theorize about how one’s employment depends on the hypothetical wage offer (often called a “reservation wage”) one would have to be given before one would leave other unpaid options behind and opt into paid labor. Motivated by this theoretical perspective, Fragile Families and other surveys have incorporated questions like: “What would the hourly wage have to be in order for you to take a job?”

However, even the best theoretically-informed social science measures perform poorly at the task of predicting outcomes. R-squared, a measure of a model’s predictive validity, often ranges from 0.1 to 0.3 in published social science papers. Simply put, a huge portion of the variance in outcomes we care about is unexplained by the predictors social scientists have invented and put their faith in.

Inductive interviews: A source of new hypotheses

How can we be missing so much? Part of the problem might be that academics who propose these theoretical perspectives often spend their lives far from the context in which the data are actually collected. An alternative, inductive approach is to conduct open-ended interviews with interesting cases and allow the theory to emerge from the data. This approach is often used in ethnographic and other qualitative work, and points researchers toward alternative perspectives they never would have considered on their own.

Inductive approaches have their drawbacks: researchers might develop a theory that works well for some children, but does not generalize to other cases. Likewise, the unmeasured factors we discover will not necessarily be causal. However, inductive interviews will generate hypotheses that can be later evaluated using deductive approaches in new datasets, and finally evaluated with randomized controlled trials.

An ideal combination: Cycling between the two

To our knowledge, the Fragile Families Challenge is the first attempt to cycle between these two approaches. The study was designed with deductive approaches: researchers asked questions based on social science theories about the reproduction of disadvantage. However, we can use qualitative interviews to inductively learn new variables that ought to be collected. Finally, we will incorporate these variables in future waves of data collection to deductively evaluate theories generated in the interviews, using out-of-sample data.

By participating in the Fragile Families Challenge, you are part of a scientific endeavor to create the surveys of the future.

Ian Lundberg

February 18, 2017

Missing data

Uncategorized 1 comment

This blog post

discusses how missing data is coded in the Fragile Families study
offers a brief theoretical introduction to the statistical challenges of missing data
links to software that implements one solution: multiple imputation

Of course, you can use any strategy you want to deal with missing values: multiple imputation is just one strategy among many.

Missing data in the Fragile Families study

Missing data is a challenge in almost all social science research. It generally comes in two forms:

Item non-response: Respondents simply refuse to answer a survey question.
Survey non-response: Respondents cannot be located or refuse to answer any questions in an entire wave of the survey.

While the first problem is common in any dataset, the second is especially prevalent in panel studies like Fragile Families, in which the survey is composed of interviews conducted at various child ages over the course of 15 years.

While the survey documentation details the codes for each variable, a few global rules summarize the way missing values are coded in the data. The most common responses are bolded.

-9 Not in wave – Did not participate in survey/data collection component
-8 Out of range – Response not possible; rarely used
-7 Not applicable (also -10/-14) – Rarely used for survey questions
-6 Valid skip – Intentionally not asked question; question does not apply to respondent or response known based on prior information.
-5 Not asked “Invalid skip” – Respondent not asked question in the version of the survey they received.
-3 Missing – Data is missing due to some other reason; rarely used
-2 Don’t know – Respondent asked question; Responded “Don’t Know”.
-1 Refuse – Respondent asked question; Refused to answer question

When responses are coded -6, you should look at the survey questionnaire to determine the skip pattern. What did these respondents tell us in prior questions that caused the interviewer to skip this question? You can then decide the correct way to code these values given your modeling approach.

When responses are coded -9, you should be aware that many questions will be missing for this respondent because they missed an entire wave of the survey.

For most other categories, an algorithmic solution as described below may be reasonable.

Theoretical issues with missing data

Before analyzing data with missing values, researchers must make assumptions about how some data came to be missing. One of the most common assumptions is the assumption that data are missing at random. For this assumption to hold, the pattern of missingness must be a function of the other variables in the dataset, and not a function of any unobserved variables once those observed are taken into account.

For instance, suppose children born to unmarried parents are less likely to be interviewed at age 9 than those born to married parents. Since the parents’ marital status at birth is a variable observed in the dataset, it is possible to adjust statistically for this problem. Suppose, on the other hand, that some children miss the age 9 interview because they suddenly had to leave town to attend the funeral of a their second cousin once removed. This variable is not in the dataset, so no statistical adjustment can fully account for this kind of missingness.

For a full theoretical treatment, we recommend

One solution: Imputation

Once we assume that data are missing at random, a valid approach to dealing with the missing data is imputation. This is a procedure whereby the researcher estimates the association between all of the variables in the model, then fills in (“imputes”) reasonable guesses for the values of the missing variables.

The simplest version of imputation is known as single imputation. For each missing value, one would use an algorithm to guess the correct value for every missing observation. This produces one complete dataset, which can be analyzed like any other. However, single imputation fails to account for our uncertainty about the true values of the missing cases.

Multiple imputation is a procedure that produces several data sets (often in the range of 5, 10, or 30), with slightly different imputed values for the missing observations in each data set. Differences across the datasets capture our uncertainty about the missing values. One can then estimate a model on each imputed dataset, then combine estimates across the imputed datasets using a procedure known as Rubin’s rules.

Ideally, one would conduct multiple imputation on a dataset with all of the observed variables. In practice, this can become computationally intractable in a dataset like Fragile Families with thousands of variables. In practice, researchers often select the variables to be included in their model, restrict the data to only those variables, and then multiply impute missing values in this subset.

Implementing multiple imputation

There are many software packages to implement multiple imputation. A few are listed below.

In R, we recommend Amelia (package home, video introduction, vignette, documentation) or MICE (package home, introductory paper, documentation). Depending on your implementation, you may also need mitools (package home,vignette, documentation) or Zelig (website) to combine estimates from several imputed datasets.

In Stata, we recommend the mi set of functions as described in this tutorial.

In SPSS, we recommend this tutorial.

In SAS, we recommend this tutorial.

This set is by no means exhaustive. One curated list of software implementations is available here.

Ian Lundberg

February 18, 2017

Evaluating submissions

Uncategorized No comments

We will evaluate submissions based on predictive validity, measured in the held-out test data by mean squared error loss for continuous outcomes and Brier loss for binary outcomes.

A leaderboard will rank submissions according to these criteria, using a set of held-out data. After the challenge closes, we will produce a finalized ranking of submissions based on a separate set of withheld true outcome data.

Each of the 6 outcomes will be evaluated and ranked independently – feel free to focus on predicting one outcome well!

What does this mean for you?

You should produce a submission that performs well out of sample. Mean squared error is a function of both bias and variance. A linear regression model with lots of covariates is an unbiased predictor, but it might overfit the data and produce predictions that are highly sensitive to the sample used for training. Computer scientists often refer to this problem as the challenge of distinguishing the signal from the noise; you want to pick up on the signal in the training data without picking up on the noise.

An overly simple model will fail to pick up on meaningful signal. An overly complex model will pick up too much noise. Somewhere in the middle is a perfect balance – you can help us find it!

Ian Lundberg

February 17, 2017

Causal inference

Uncategorized No comments

The Fragile Families Challenge presents a unique opportunity to probe the assumptions required for causal inference with observational data. This post introduces these assumptions and highlights the contribution of the Fragile Families Challenge to this scientific question.

Causal inference: The problem

Social scientists and policymakers often wish to use empirical data to infer the causal effect of a binary treatment D on an outcome Y. The causal effect for each respondent is the potential outcome that each observation would take under treatment (denoted Y(1)) minus the potential outcome that each observation would take under control (denoted Y(0)). However, we immediately run into the fundamental problem of causal inference: each observation is observed either under the treatment condition or under the control condition.

The solution: Assumptions of ignorability

The gold standard for resolving this problem is a randomized experiment. By randomly assigning treatment, researchers can ensure that the potential outcomes are independent of treatment assignment, so that the average difference in outcomes between the two groups can only be attributable to treatment. This assumption is formally called ignorability.

Ignorability: {Y(0),Y(1)} 丄 D

Because large-scale experiments are costly, social scientists frequently draw causal inferences from observational data based on a simplifying assumption of conditional ignorability.

Conditional ignorability: {Y(0),Y(1)} 丄 D | X

Given a set of covariates X, conditional ignorability states that treatment asignment D is independent of the potential outcomes that would be realized under treatment Y(1) and control Y(0). In other words, two observations with the same set of covariates X but with different treatment statuses can be compared to estimate the causal effect of the treatment for these observations.

Assessing the credibility of the ignorability assumption

Conditional ignorability is an enormous assumption, yet it is what the vast majority of social science findings rely on. By writing the problem in a Directed Acyclic Graph (DAG, Pearl 2000), we can make the assumption more transparent.

X represents pre-treatment confounders that affect both the treatment and the outcome. Though it is not the only way to do so, researchers often condition on X by estimating the probability of treamtent given X, denoted P(T | X). Once we account for the differential probability of a treatment by the background covariates (through regression, matching, or some other method), we say we have blocked the noncausal backdoor paths connecting T and Y through X.

The key assumption in the left panel has to do with Ut. We assume that all unobserved variables that affect the treatment (Ut) have no affect on the outcome Y, except through T. This is depicted graphically by the dashed line from Ut to Y, which we must assume does not exist for causal inferences to be valid.

Researchers often argue that conditional ignorability is a reasonable assumption if the set of predictors included in X is extensive and detailed. The Fragile Families Challenge is an ideal setting in which to test the credibility of this assumption: we have a very detailed set of predictor variables X collected from birth through age 9, which occur temporally prior to treatments reported at age 15.

Nevertheless, the assumption of conditional ignorability is untestable. Interviews may provide some insight to the credibility of this assumption.

Goal of the Fragile Families Challenge: Targeted interviews

Through targeted interviews with particularly informative children, we might be able to learn something about the plausibility of the conditional ignorability assumption.

One of the binary variables in the Fragile Families Challenge is whether a child was evicted from his or her home. We will treat this variable as T. We want to know the causal effect of eviction on a child’s chance of graduating from high school (Y). In the Fragile Families Challenge, the set of observed covariates X is all 12,000+ predictor variables included in the Fragile Families Challenge data file.

Based on the ensemble model from the Fragile Familie Challenge, we will identify 20 children who were evicted, and 20 similar children who had similar predicted probabilities of eviction but were not evicted. We will interview these children to find out why they were evicted.

Potential interviews in support of conditional ignorability:

Suppose we find that children were evicted because their landlords were ready to retire and wanted to get out of the housing market. Those who were not evicted had younger landlords. It might be plausible that the age of one’s landlord is an example of Ut: a variable that affects eviction but has no effect on high school graduation except through eviction. While this would not prove the conditional ignorability assumption, the assumption might seem reasonable in this case.

Potential interviews that discredit conditional ignorability:

Suppose instead that we find a different story. Gang activity increased in the neighborhoods of some families, escalating to the point that landlords decided to get out of the business and evict all of their tenants. Other families lived in neighborhoods with no gang activity, and they were not evicted. In addition to its effect on eviction, it is likely that gang activity would alter the chances of high school graduation in other ways, such as by making students feel unsafe at school. In this example, gang activity plays the role of Uty and would violate the assumption of conditional ignorability.

Summary

Because costs prohibit randomized experiments to evaluate all potential treatments of interest to social scientists, scholars frequently rely on the assumption of conditional ignorability to draw causal claims from observational data. This is a strong and untestable assumption. The Fragile Families Challenge is a setting in which the assumption may be plausible, due to the richness of the covariate set X, which includes over 12,000 pre-treatment variables chosen for their potentially important ramifications for child development.

By interviewing a targeted set of children chosen by ensemble predictions of the treatment variables, we will shed light on the credibility of the ignorability assumption.

Matt Salganik

February 10, 2017

upload your contribution

Uncategorized No comments

This post will walk you through the steps to prepare your files for submission and upload them to the submission platform. The organizer of your group (i.e. your professor or TA) will provide a link to the submission platform.

1. Save your predictions as prediction.csv.

This file should be structured the same way as the “prediction.csv” file provided as part of your data bundle.

This file should have 4,242 rows: one for each observation in the test set.

Do I need to make predictions for both training and test sets?

We are asking you to make predictions for all 4,242 cases, which includes both the training cases from train.csv and the held-out test cases. We would prefer that you not simply copy these cases from train.csv to prediction.csv. Instead, please submit the predictions that come out of your model. This way, we can compare your performance on the training and test sets, to see whether those who fit closely to the training set perform more poorly on the test set (see our blog discussing overfitting). Your scores will be determined on the basis of test observations alone, so your predictions for the cases included in train.csv will not affect your score.

Are there NA cases? How will these be scored?

There are some observations that are truly missing: we do not have the true answer for these cases because respondents did not complete the interview or did not answer the question. This is true for both the training and the test sets. Your predictions for these cases will not affect your scores. We are asking you to make predictions for missing cases because it is possible that we will find those respondents sometime in the future and uncover the truth. It will be scientifically interesting to know how well the community model was able to predict these outcomes which even the survey staff did not know at the time of the Challenge.

This file should have 7 columns for the ID number and the 6 outcomes. They should be named:

challengeID, gpa, grit, materialHardship, eviction, layoff, jobTraining

The top of the file will look like this (numbers here are random). challengeID numbers can be in any order.

2. Save your code.

3. Create a narrative explanation of your study. This should be saved in a file called “narrative” and can be a text file, PDF, or Word document.

At the top of this narrative explanation, tell us your names of everyone on the team that produced the submission, or your name if you worked alone, in the format:

Homer Simpson,
homer@gmail.com

Marge Simpson,
msimpson@gmail.com

Then, tell us about how you developed the submission. This might include your process for preparing a the data for analysis, methods you used in the analysis, how you chose the submission you settled on, things you learned, etc.

4. Zip all the files together in one folder.

It is important that the files be zipped in a folder with no sub-directories. Instructions are different for Mac and windows.

On Mac, highlight all of the individual files.

Right click and choose “Compress 3 items”.

On Windows, highlight all of the individual files.

Right click and choose
Send to -> Compressed (zipped) folder

5. Upload the zipped folder to the submission site. The link to this will be provided to you by the organizers (i.e. your professor or TA) of your specific instance of the Fragile Families Challenge.

Click the “Participate” tab at the top, then the “Submit / View Results” tab on the left. Click the “Submit” button to upload your submission.

6. Wait for the platform to evaluate your submission.

Click “Refresh status” next to your latest submission to view its updated status and see results when they are ready. If successful, you will automatically be placed on the leaderboard when evaluation finishes.

Matt Salganik

February 10, 2017

build a model

Uncategorized No comments

Take our data and build models for the 6 child outcomes at age 15. Your model might draw on social science theories about variables that affect the outcomes. It might be a black-box machine learning algorithm that is hard to interpret but performs well. Perhaps your model is some entirely new combination no one has ever seen before!

The power of the Fragile Families Challenge comes from the heterogeneity of quality individual models we receive. By working together, we will harness the best of a range of modeling approaches. Be creative and show us how well your model can perform!

There are missing values. What do I do?

See our blog post on missing data.

What if I have several ideas?

You can try them all and then choose the best one! Our submission platform allows you to upload up to 10 submissions per day. Submissions will instantly be scored, and your most recent submission will be placed on the leaderboard. If you have several ideas, we suggest you upload them each individually and then upload a final submission based on the results of the individual submissions.

What if I don’t have time to make 6 models?

You can make predictions for whichever outcome interests you. To upload a submission with the appropriate file size, make a simple numeric guess for the rest of the outcomes. For instance, you might develop a careful model for grit, and then guess the mean of the training values for all of the remaining five observations. This would still allow you to upload 6 sets of predictions to the scoring function.

Matt Salganik

February 10, 2017

Apply to participate

Uncategorized No comments

The Fragile Families Challenge is now closed. We are no longer accepting applications!

What will happen after I apply?

We will review your application and be in touch by e-mail. This will likely take 2-3 business days. If we invite you to participate, you will be asked to sign a data protection agreement. Ultimately, each participant will be given a zipped folder which consolidates all of the relevant pieces of the larger Fragile Families and Child Wellbeing Study in three .csv files.

background.csv contains 4,242 rows (one per child) and 12,943 columns:

challengeID: A unique numeric identifier for each child.
12,942 background variables asked from birth to age 9, which you may use in building your model.

train.csv contains 2,121 rows (one per child in the training set) and 7 columns:

challengeID: A unique numeric identifier for each child.
Six outcome variables (each variable name links to a blog post about that variable)
1. Continuous variables: grit, gpa, materialHardship
2. Binary variables: eviction, layoff, jobTraining

prediction.csv contains 4,242 rows and 7 columns:

challengeID: A unique numeric identifier for each child.
Six outcome variables, as in train.csv. These are filled with the mean value in the training set. This file is provided as a skeleton for your submission; you will submit a file in exactly this form but with your predictions for all 4,242 children included.

Understanding the background variables

To use the data, it may be useful to know something about what each variable (column) represents. Full documentation is available here, but this blog post distills the key points.

Waves and child ages

The background variables were collected in 5 waves.

Wave 1: Collected in the hospital at the child’s birth.
Wave 2: Collected at approximately child age 1
Wave 3: Collected at approximately child age 3
Wave 4: Collected at approximately child age 5
Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.

Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question a4.

The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child’s age (baby’s age).
m1, m2, m3, m4, m5: Questions asked of the child’s mother in wave 1 through wave 5.
f1,...,f5: Questions asked of the child's father in wave 1 through wave 5
hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
p5: Questions asked of the primary caregiver in wave 5.
k5: Questions asked of the child (kid) in wave 5
ffcc: Questions asked in various child care provider surveys in wave 3
kind: Questions asked of the kindergarten teacher
t5: Questions asked of the teacher in wave 5.

Ready to work with the data?

See our posts on building a model and working with missing data.