Author matt-salganik

Author matt-salganik

Upcoming event: The Future of Big Data and Survey Methods

Uncategorized No comments

We are excited to have a chance to discuss the Fragile Families Challenge as part of a panel at the University of Michigan, Institute for Social Research. The title of the panel is: The Future of Big Data and Survey Methods. Please join us at the event. More information is below.

New Data Science methods and mass collaborations pose both exciting opportunities and important challenges for social science research. This panel will explore the relationship between these new approaches and traditional survey methodology. Can they coexist, or even enrich one another? Matthew Salganik is one of the lead organizers of the Fragile Families Challenge, which uses data science approaches such as predictive modeling, mass collaboration, and ensemble techniques. Jeremy Freese is co-PI of the General Social Survey and of a project on collaborative research in the social sciences. Colter Mitchell has conducted innovative work combining biological data and methods with Fragile Families and other survey data sets.

Sponsored by the Computational Social Science Rackham Interdisciplinary Workshop and the Population Studies Center’s Freedman Fund.

Friday, 10/6/2017, 3:10 PM
Location: 1430 ISR-Thompson

Correction to prize winners

Uncategorized No comments

When newspapers have to correct a published article, they issue a correction that notes the errors in the prior version and how they have been corrected. Following this logic, this blog post explains a correction we have made to the prize winners blog.

At the close of the Challenge, one team (MDRC) mistakenly believed that the submission deadline, listed as 6pm UTC on Codalab, was 6pm Eastern Time. After the close of the Challenge at 2pm, they were unable to upload their submission. They emailed us very soon after the 2pm deadline indicating that they had misunderstood. Our Board of Advisors reviewed the case carefully and decided to accept this submission. We made this decision before we opened the holdout data.

When we actually evaluated the submissions with the holdout data, we downloaded all final submissions from Codalab and neglected to add the e-mailed MDRC submission to the set. The team noticed they were not on the final scores page and emailed us to ask. A week after opening the holdout set, we added their submission to the set, re-evaluated all scores, and discovered that this team had achieved the best score in eviction and job training, two prizes we had already awarded to other teams.

In consultation with our Board of Advisors, we decided to do three things.

First, we updated the final prize winners to recognize MDRC.

Second, we recognized that this was an unusual situation. Other teams had rushed to the 2pm deadline and might have scored better with a few extra hours of work. For this reason, we decided to create a new category: special honorary prizes. If MDRC won for an outcome, the second-place team (i.e. the team that was in first place at the close of the Challenge at 2pm) would be awarded a special honorary prize.

Third, we updated the prize winners figure and score ranks to include MDRC along with all submissions previously included.

All prize winners (final, progress, innovation, foundational, and special honorary) are invited to an all-expense-paid trip to Princeton University to present their findings at the scientific workshop.

Prize winners

Uncategorized No comments

The Fragile Families Challenge received over 3,000 submissions from more 150 teams between the pilot launch on March 3, 2017, and the close on August 1, 2017. Each team’s final submission score on the holdout set is provided at this link. In this blog post, we are excited to announce the prize winners!

Final prizes

We are awarding prizes to the top-scoring submissions for each outcome, as measured by mean-squared error. The winners are:

  • GPA: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Grit: sy (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Material hardship: haixiaow (Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi; Ph.D. students, Department of Politics, Princeton University)
  • Eviction: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)
  • Layoff: Pentlandians (MIT Media Lab, Human Dynamics Group: Abdullah Almaatouq, Eaman Jahani, Daniel Rigobon, Yoshihiko Suhara, Khaled Al-Ghoneim, Abdulla Alhajri, Abdulaziz Alghunaim, Alfredo Morales-Guzman)
  • Job training: MDRC (Kristin Porter, Richard Hendra, Tejomay Gadgil, Sarah Schell, and Meghan McCormick)

Progress prizes

As promised, we are also awarding progress prizes to the top-scoring submissions for each outcome among submissions made by May 10, 2017 at 2pm Eastern Time. The following teams had the best submission as of this deadline are:

  • GPA: ovarol (Onur Varol, postdoctoral researcher at the Center for Complex Network Research, Northeastern University Networks Science Institute)
  • Grit: rap (Derek Aguiar, Postdoctoral Researcher, and Ji-Sung Kim, Undergraduate Student, Department of Computer Science, Princeton, NJ)
  • Material hardship: ADSgrp5
  • Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
  • Layoff: the_Brit (Professor Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK)
  • Job training: nmandell (Noah Mandell, Ph.D. candidate in plasma physics at Princeton University)

Foundational award

Greg Gunderson (ggunderson) produced machine-readable metadata that turned out to be very helpful for many participants. You can read more about the machine-readable metadata in our blog post on the topic. In addition to being useful to participants, this contribution was also inspirational for the Fragile Families team. They saw what Greg did and wanted to build on it. A team of about 8 people is now working to standardize aspects of the dataset and make more metadata available. Because Greg provided a useful tool for other participants, open-sourced all aspects of the tool, and inspired important changes that will make the larger Fragile Families project better, we are awarding him the foundational award.

Innovation awards

The Board of Advisers of the Fragile Families Challenge would also like to recognize several teams for particularly innovative contributions to the Challenge. For these prizes, we only considered teams that were not already recognized for one of the awards above. Originally, we planned to offer two prizes: “most novel approach using ideas from social science” and “most novel approach using ideas from data science.” Unfortunately, this proved very hard to judge because many of the best submissions combined data science and social science.

Therefore, after much deliberation and debate, we have decided to award two prizes to for innovation. These submissions each involved teams of people working collaboratively. Each team thought carefully about the raw data and cleaned variables manually to provide useful inputs to the algorithm, much as a social scientist typically would. Each team then implemented well-developed machine learning approaches to yield predictive models.

We are recognizing the following teams:

  • bjgoode (Brian J. Goode, Virginia Tech, acknowledging Dichelle Dyson and Samantha Dorn)
  • carnegien (Nicole Carnegie, Montana State University, and Jennifer Hill and James Wu, New York University)

We are encouraging these teams to prepare blog posts and manuscripts to explain their approaches more fully. To be clear, however, there were many, many innovative submissions, and we think that a lot of creative ideas were embedded in code and hard to extract from the short narrative explanations. We hope that all of you will get to read about these contributions and more in the special issue of Socius.

Special honorary prizes

As explained in our correction blog post, our Board of Advisors decided to accept a submission that arrived shortly after the deadline, because of confusing statements on our websites about the hour at which the Challenge closed. This team (mdrc) had the best score for two outcomes (eviction and job training) and was awarded the final prize for each of these outcomes. Because we recognize that this was an unusual situation, we are awarding special honorary prizes to the second-place teams for each of these outcomes.

  • Eviction: kouyang (Karen Ouyang and Julia Wang, Princeton Class of 2017)
  • Job training: malte (Malte Moeser, Ph.D. student, Department of Computer Science, Princeton University)


Thank you again to everyone that participated. We look forward to more exciting results to come in the next steps of the Fragile Families Challenge, and we hope you will join us for the scientific workshop (register here) at Princeton University on November 16-17!

Fragile Families Challenge, next steps

Uncategorized No comments

Stage one of the Fragile Families Challenge, the predictive modeling stage, ended today at 2pm ET.  We are grateful to everyone who participated.  This is not, however, the end of the Fragile Families Challenge.  In fact, there are many important and exciting things to come.  We will be:

We are looking forward to all of the next steps in the Fragile Families Challenge.

Fragile Families Challenge Scientific Workshop, Nov 16 & 17

Uncategorized No comments

We are happy to announce the Fragile Families Challenge Scientific Workshop will take place November 16th and 17th (Thursday and Friday) at Princeton University.  The workshop is open to everyone interested in the Challenge, and we will be livesteaming it for people who are not able to travel to Princeton (livestream link, no registration required).

On Thursday, we will meet in Palmer House (map). On Friday, we will meet in Wallace Hall 300 (map).

The schedule will be:

  • Thursday: Workshop of submissions to the special issue of Socius and breakout projects. All are welcome regardless of whether you have written a paper.
  • Friday: Presentations by prize winners. All are welcome to attend, regardless of whether you have won a prize.

If you plan to join us, please complete the registration form.

Thursday, November 16

The first day of the Fragile Families Challenge Scientific Workshop will be devoted to: 1) workshopping papers submitted to the Special Issue of Socius on the Fragile Families Challenge and 2) working on breakout projects.

Workshopping papers

Before the workshop, each participant will be sent 3 papers to read. Participants are expected to read these papers before arriving at the workshop. This will ensure that we have a lively and focused discussion. If you are unable to read the papers ahead of time, please let us know and plan to arrive at lunch time.

Then at the workshop, each paper will be discussed for 45 minutes in a series of parallel roundtable sessions. There will be no presentations by the author because everyone will have read the paper ahead of time. Instead, each session will begin with very brief comments by a pre-assigned moderator, and then there will be a group discussion, which will be facilitated by the moderator. We expect these to be lively, fascinating, and generative discussions.

  • “Black Box Models and Sociological Explanations: Predicting GPA Using Neural Networks” by Thomas Davidson
  • “Humans in the Loop: Priors and Missingness on the Road to Prediction” by Connor Gilroy, Anna Filippova, Ridhi Kashyap, Antje Kirchner, Allison Morgan, Kivan Polimis, Adaner Usmani, and Tong Wang
  • “Privacy, ethics, and high-dimensional social science data: A case study of the Fragile Families Challenge” by Ian Lundberg, Arvind Narayanan, Karen E.C. Levy, and Matthew J. Salganik
  • “Making the analysis of complex survey data more efficient, reliable, and enjoyable: A case study from the Fragile Families Challenge” by Alexander T. Kindel, Kristin Catena, Tom Hartshorne, Kate Jaeger, Dawn Koffman, Sara S. McLanahan, Maya Phillips, Shiva Rouhani, and Matthew J. Salganik
  • “Modeling and Decision Making with Social Systems: Lessons Learned from the Fragile Families Challenge” by Brian Goode, Debanjan Datta, and Naren Ramakrishnan
  • “The Pentlandians ensemble: Winning models for GPA, grit, and layoff in the Fragile Families Challenge” by Daniel Rigobon, Eaman Jahani, Yoshihiko Suhara, Khaled AlGhoneim, Abdulaziz Alghunaim, Alex Pentland, and Abdullah Almaatouq
  • “Predicting material hardship using machine learning” by Erik H. Wang, Diana Stanescu, and Soichiro Yamauchi
  • “The challenges of data science from social science: Using social science knowledge in the Fragile Families Challenge” by Stephen McKay
  • “Predictive features of children GPA in Fragile Families” by Naijia Liu, Hamidreza Omidvar, and Jinjin Zhao
  • “Variable selection and parameter tuning for BART modeling in the Fragile Families Challenge” by James Wu and Nicole Carnegie

Breakout activities

With so many amazing people all in one place, we also wanted to leave time for ideas that you propose, either ahead of time or as a result of the workshopping of the papers. Any participant can propose an idea, and then people can choose which one they want to work on. We’re also going to propose the following projects:

  • Code walkthrough for the new Fragile Families metadata API (lead by Maya Phillips)
  • Demo and testing for new metadata website (lead by Alex Kindel)
  • Testing and improving the Docker container that ensure reproducibility of the special issue (lead by David Liu)
  • Assessing test-retest reliability of concept tags (lead by Kristin Catena)
  • Help us digitize question and answer texts (lead by Tom Hartshorne)

Schedule for Thursday

  • 08:30 – 09:00 Breakfast
  • 09:00 – 09:15 Intro
  • 09:15 – 10:00 Round 1 of papers
  • 10:00 – 10:15 Break
  • 10:15 – 11:00 Round 2 of papers
  • 11:00 – 11:15 Break
  • 11:15 – 12:00 Round 3 of papers
  • 12:00 – 1:00 Lunch
  • 1:00 – 1:30 Discussion of project ideas
  • 1:30 – 5:00 Breakout activities
  • 5:00 – 6:00 Break
  • 6:00 – ??? Dinner

Friday, November 17

The second day of the Fragile Families Challenge Scientific Workshop will be devoted to presentations from the organizers and prize winners.

  • 8:30 – 9:00. Breakfast
  • 9:00 – 9:45. Welcome, Overview of the Fragile Families Challenge
    • Matthew J. Salganik, Professor of Sociology, Princeton University
    • Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study
  • 9:45 – 10:00. Break
  • 10:00 – 11:15. Presentations from progress prize winners and discussion
    • Onur Varol, Postdoctoral Researcher, Center for Complex Network Research, Networks Science Institute, Northeastern University
    • Julia Wang, Princeton University
    • Stephen McKay, School of Social & Political Sciences, University of Lincoln, UK
  • 11:15 – 11:30. Break
  • 11:30 – 12:00. Presentation from foundational prize winner and discussion
    • Gregory Gundersen, PhD Student in Computer Science, Princeton University
  • 12:00 – 1:00. Lunch
  • 1:00 – 2:00. Presentations from innovation prize winners and discussion
    • Nicole Carnegie, Assistant Professor of Statistics, Montana State University
    • Brian J. Goode, Research Scientist, Discovery Analytics Center, Virginia Tech
  • 2:00 – 2:30. Break
  • 2:30 – 4:00 Presentations from final prize winners and discussion
    • Kristin Porter, Senior Associate, MDRC
    • Diana Stanescu, Erik H. Wang, and Soichiro Yamauchi, Ph.D. students, Department of Politics, Princeton University
    • Abdullah Almaatouq (MIT), Eaman Jahani (MIT), Daniel E. Rigobon (MIT), Yoshihiko Suhara (Recruit Institute of Technology and MIT)
  • 4:00 – 4:30. Break
  • 4:30 – 5:00. What’s next
    • Matthew J. Salganik, Professor of Sociology, Princeton University
    • Sara S. McLanahan, William S. Tod Professor of Sociology and Public Affairs at Princeton University and Principal Investigator of the Fragile Families and Child Wellbeing Study

We will update this page as we have more information. If you have any questions about the Scientific Workshop, please email us.

Getting scores on holdout data

Uncategorized 2 comments

As described in an earlier blog post, there will be a special issue of Socius devoted to the Fragile Families Challenge. We think that the articles in this special issue would benefit from reporting their scores on both the leaderboard data and the holdout data. However, we don’t want to release the holdout data on August 1 because that could lead to non-transparent reporting of results. Therefore, beginning on August 1, we will do a controlled release of the scores on the holdout data. Here’s how it will work:

  • All models for the special issue must be submitted by August 1.
  • Between August 1 and October 1 October 16 you can complete a web form requesting scores on the holdout data for a list of the models. We will send you those scores.
  • You must report all the scores you requested in your manuscript or the supporting online material. We are requiring you to report all the scores that you request in order to prevent selective reporting of especially good results.

We realize that this procedure is a bit cumbersome, but we think that this extra step is worthwhile in order to ensure the most transparent reporting possible of results.

Submit your request for scores here.

Event at the American Sociological Association Meeting

Uncategorized No comments

We are happy to announce that there will be a Fragile Families Challenge event Sunday, August 13 at 2pm at the American Sociological Association Annual Meeting in Montreal. We will gather at the Fragile Families and Child Wellbeing Study table in the Exhibit Hall (220c). We are the booth in the back right (booth 925). This will be a great chance to meet other participants, share experiences, and learn more about the next stages of the mass collaboration and the Fragile Families study more generally. See you in Montreal!

A Data Pipeline for the Fragile Families Challenge

Uncategorized 1 comment

Guest blog post by Anna Filippova, Connor Gilroy, and Antje Kirchner

In this post, we discuss the challenges of preparing the Fragile Families data for modeling, as well as the rationales for the methods we chose to address them. Our code is open source, and we hope other Challenge participants find it a helpful starting point.

If you want to dive straight into the code, start with the vignette here.

Data processing

The people who collect and maintain the Fragile Families data have years of expertise in understanding the data set. As participants in the Fragile Families Challenge, we had to use simplifying heuristics to get a grasp on the data quickly, and to transform as much of it as possible into a form suitable for modeling.

A critical step is to identify different variables types, or levels of measurement. This matters because most statistical modeling software transforms categorical covariates into a series of k – 1 binary variables, while leaving continuous variables untransformed. Because categorical variables are stored as integers, with associated strings as labels, a researcher could just use those integers directly in a model instead—but there is no guarantee that they would be substantively meaningful. For interpretation, and potentially for predictive performance, accounting for variable type is important.

This seems like a straightforward problem. After all, it is typically clear whether a given variable is categorical or continuous from the description in the codebook. With a handful of variables, classifying them manually is a trivial task, but this is impossible with over 12,000 variables. An automated solution that works well for the majority of variables is to leverage properties of the Stata labels, using haven, to convert each variable into the appropriate R class—factor for categorical variables, numeric for continuous. We previously released the results of this work as metadata, and here we put it to use.

A second problem similarly arises from the large number of variables in the Fragile Families data.  While some machine learning models can deal with many more parameters than observations (p >> n), or with high amounts of collinearity among covariates, most imputation and modeling methods run faster and more successfully with fewer covariates. Particularly when demonstrating or experimenting with different modeling approaches, it’s best to start out with a smaller set of variables. If the constructed variables represent expert researchers’ best attempts to summarize, consolidate, and standardize survey responses across waves, then those variables make a logical starting point. Fortunately, most of these variables can be identified with a simple regular expression.

Finally, to prepare for imputation, Stata-style missing values (labelled negative numbers) need to be converted to R-style NAs.

Missing data

Data may be missing in a (panel) study for many reasons, including respondent’s unwillingness to answer a question, a don’t know response, skip logic (for questions that do not apply to a given respondent), and panel attrition (for example, due to locating difficulties for families). Additional missing data might be due to data entry errors and—particularly relevant for the challenge—anonymization to protect sensitive information of members of a particularly vulnerable population.

What makes missing data such a challenge for computational approaches? Many statistical algorithms operate on complete data, often obtained through listwise deletion of cases. This effectively assumes that instances are missing completely at random. The Fragile Families data are not missing completely at random; moreover, the sheer amount of missingness would leave few cases remaining after listwise deletion. We would expect a naive approach to missingness to significantly reduce the predictive power of any statistical model.

Therefore, a better approach is to impute the missing data, that is, make a reasonable guess about what the missing values could have been. However, current approaches to data imputation have some limitations in the context of the Fragile Families data:

  • Standard packages like Amelia perform multiple imputation from a multivariate normal distribution, hence they are unable to work on the full set of 12,000 covariates with only 4,000 observations This is also computationally intensive, taking several hours to run even when using a regularizing prior, a subset of variables, and running individual imputations in parallel.
  • Another promising approach would be to use Full Information Maximum Likelihood estimation. FIML estimation models sparse data without the need for imputation, thus offering better performance. However, no open-source implementation for predictive modeling with FIML exists at present.
  • We could also use the existing structure of the data to make logical edits. For instance, if we know a mother’s age in one wave, we can extrapolate this to subsequent waves if those values are missing. Carrying this idea a step further, we can make simple model-based inferences; if, for example, a father’s age is missing entirely, we can impute this from the distribution of differences between mother’s and father’s ages. This process, however, requires treating each variable individually.

To address some of these issues, our approach to missing data considers each variable in the data-set in isolation (for example cm1hhinc, mother’s reported household income at wave 1), and attempts to automatically identify other variables in the data-set that may be strongly associated with this variable (such as cm2hhinc, mother’s reported household income at wave 2 and cf1hhinc, father’s reported household income at wave 1). Assembling a set of 3 to 5 of such associations per variable allows us to construct a simple multiple-regression model to predict the possible value of the missing data for each column (variable) of interest.

Our approach draws on two forms of multiple-regression models, a simple linear ordinary-least squares regression, and a linear regression with lasso penalization. To evaluate their performance, we compare our approach to two alternative forms of imputation: a naive mean-based imputation, and imputation using the Amelia package. Holding constant the method we use to make predictions and the variables used, our regression-based approach outperforms mean imputation on the 3 categorical outcome variables: Eviction, Layoff, and Job Training. The Lasso imputation also outperforms Amelia on these variables, but the unpenalized regression imputation has mixed effects. Interestingly, mean imputation performs the best for GPA and Grit, and we saw a similar performance on Material Hardship using mean imputation, Amelia, and linear regression, but Lasso was significantly worse than the former approaches. Overall, even simple mean imputation performed better than using Amelia on this dataset.

The approach we used comes with a number of assumptions:

  1. We assume that the best predictors of any given variable already exist in the Fragile Families dataset, and do not need significant processing. This is not an unreasonable assumption, as many variables in the dataset are collected across different waves, thus there may be predictable relationships between each wave.
  2. Our tests above assume a  linear relationship between predictor variables and the variable we impute, although our code has an option to also take into account polynomial effects (the ‘degree’ option available when using method=’lasso’).
  3. To get complete predictions for all 4000 cases using the regression models, we needed to first impute means of the covariates used for the imputation. In other words, in order to fill in missing data, we paradoxically needed to first fill in missing data. FIML is one solution to this challenge, and we hope to see this make its way into predictive modelling approaches in languages like R or Python.

Our pipeline

We modularized our work into two separate repositories, following the division of labor described above.

For general data processing, ffc-data-processing, which

  1. Works from the background.dta Stata file to extract covariate information.
  2. Provides helper functions for relatively fast data transformation.

For missing data imputation, FFCRegressionImputation, which

  1. Prepares the raw background.csv data and performs a logical imputation of age-related variables as we describe above.
  2. Constructs a (correlation) matrix of strengths of relationships between a set of variables.
  3. Uses the matrix to perform a regression-based prediction to impute the likely value of a missing entry.

For a technical overview of how these two bodies of code integrate with each other, check out the integration vignette. The vignette is an RMarkdown file which can be run as-is or freely modified.

The code in the vignette subsets to constructed variables, identifies those variables as either categorical or continuous, and then only imputes missing values for the continuous variables, using regression-based imputation. We chose to restrict the variables imputed for illustrative purposes, and to improve the runtime of the vignette. Users of the code can and should employ some sort of imputation strategy—regression-based or otherwise—for the categorical variables before incorporating the covariates into a predictive model.


What seemed at the beginning to be a straightforward precursor to building predictive models turned out to have complexities and challenges of its own!

From our collaboration with others, it emerged that researchers from different fields perceive data problems very differently. A problem that might not seem important to a machine-learning researcher might strike a survey methodologist as critical to address. This kind of cross-disciplinary communication about expectations and challenges was productive and eye-opening.

In addition, the three of us came into this project with very different skillsets. We settled on R as a lingua franca, but drew on a much broader set of tools and techniques to tackle the problems posed by the Fragile Families Challenge. We would encourage researchers to explore all the programming tools at their disposal, from Stata to Python and beyond.

Finally, linking everyone’s efforts together into a single working pipeline that can be run end-to-end was a significant step by itself. Even with close communication, it took a great deal of creativity as well as clarity about desired inputs and outputs.

We hope that other participants in the Fragile Families Challenge find our tools and recommendations useful. We look forward to seeing how you can build on them!

Metadata about variables

Uncategorized No comments

We are happy to announce that Challenge participant Connor Gilroy, a Ph.D. student in Sociology at the University of Washington, has created a new resource that should make working the Challenge data more efficient. More specifically, he created a csv file that identifies each variable in the Challenge data file as either categorical, continuous, or unknown. Connor has also open sourced the code that he used to create the csv file. We’ve had many requests for such a file, and Connor is happy to share his work with everyone! If you want to check and improve Connor’s work, please consult the official Fragile Families and Child Wellbeing Study documentation.

Connor’s resource is part of a tradition during the Challenge whereby people have open sourced resources to make the Challenge easier for others. Other resources include:

If you have something that you’d like to open source, please let us know.

Finally, Connor work was part of a larger team project at the Summer Institute in Computational Social Science to build a full data processing pipeline for the Fragile Families Challenge. Stay tuned for that blog post on Tuesday, July 18!

Call for papers, special issue of Socius about the Fragile Families Challenge

Uncategorized No comments

Socius Call for Papers
Special issue on the Fragile Families Challenge
Guest editors: Matthew J. Salganik and Sara McLanahan

Socius, an open access journal published by the American Sociological Association, will publish a special issue on the predictive modeling phase of the Fragile Families Challenge. All participants in the Fragile Families Challenge are invited to submit a manuscript to this special issue.

A strong manuscript for the special issue will describe the process of creating a submission to the Challenge and will describe what was learned during that process. For example, a strong manuscript will describe the different approaches that were considered for data preprocessing, variable selection, missing data, model selection, and any other steps involved in creating the final submission to the Challenge. Further, a strong manuscript will also describe how the authors decided among the many possible approaches. Finally, some manuscripts may seek to draw more general lessons about social inequality, families, the common task method, social science, data science, or computational social science. Manuscript should be written in a style that is accessible to a general scientific audience.

The editors of the special issue may also consider other types of manuscripts that are consistent with the scientific goals of the Fragile Families Challenge. If you are considering submission a manuscript different from what is described above, please contact the editors of the special issue at before submitting your manuscript.

All papers will be peer reviewed, and publication is not guaranteed. However, there is no limit on the number of articles that will be accepted in the special issue. All published papers must abide by the terms and conditions of the Fragile Families Challenge, and must be accompanied by open source code and a data file containing predictions.

Submissions for the special issue must be received through the Socius online submission platform by Sunday, October 1, 2017 Monday, October 16 at 11:59pm ET. If you have any questions about the special issue, please email


  • Do I need to describe an approach to predicting all six outcome variables in order to submit to the special issue?
  • No. We will happily consider papers that focus on one specific outcome variable.

  • Do I need to have a low mean-squared error in order for my paper to be published?
  • No. Predictive performance in the held-out dataset is only part of what we will consider. For example, a paper that clearly shows that many common strategies were not very effective would be considered a valuable contribution.

  • What if I can’t afford the Article Processing Charge?
  • Socius, like most open access journals, has an Article Processing Charge. This charge is required to keep Socius running, and it is in line with the charges at other open access journals. However, we strongly believe that the Article Processing Charge should not be a barrier to scientific participation. Therefore, the Fragile Families Challenge project will pay the Article Processing Charge for all accepted articles submitted by everyone except for tenure-track (or equivalent) faculty working in universities in OECD countries. In other words, we will cover the Article Processing Charge for undergraduates, graduate students, post-docs, and people working outside of universities. Further, we will pay the Article Processing Charge for all tenure-track (or equivalent) faculty working in universities outside the OECD.

    If for any reason you think that the Article Processing Charge may be a barrier to your participation, please send us an email and we will try to find a solution:

  • How will you decide what manuscripts to accept for publication?
  • Articles in Socius are judged by four criteria: Accuracy, Novelty, Interest, and Presentation. In the case of this special issue, these criteria will be judged by the editors of the special issue, with feedback from reviewers and the editors of Socius. For the purposes of this special issue, here is how these criteria will be interpreted:

    • Accuracy: The key question is whether this analysis was conducted appropriately and accurately. Were the techniques used in the manuscript performed and interpreted correctly? Do the claims in the manuscript match the evidence provided?
    • Novelty: The key question is whether the manuscript will be novel to some social scientists or some data scientists. Because projects like the Fragile Families Challenge are not yet common, we expect that most submitted manuscripts will be somewhat novel.
    • Interest: The key question for the editors is whether the manuscript will be interesting to some social scientists or some data scientists. Will some people want to read this paper? Does it advance understanding of the Fragile Families Challenge and related intellectual domains?
    • Presentation: The key question is whether this manuscript communicates effectively to a diverse audience of social scientist and data scientists. We will also assess whether the figures and tables are presented clearly and whether the manuscript makes appropriate use of the opportunity for supporting online information. Because these manuscripts will be short, we expect that the supporting online information will play a key role.

  • Who is the audience for these papers?
  • All papers should be written for a general scientific audience that will include both social scientists and data scientists (broadly defined). In other words, when writing your paper you should imagine an audience similar to the audience at journals such as Science and Proceedings of the National Academies of Sciences (PNAS). We would recommend reading some articles from these journals to get a sense of this style. Manuscripts that use excessive jargon from a specific field will be asked to make revisions.

    Manuscripts should follow the length guidelines of a Report published in Science: 2,500 words, with up to 4 figures or tables. Additional materials should be included in supporting online materials. We will consider articles that deviate from these guidelines in some situations. Other aspects of the manuscript format will follow standard Socius rules.

  • Should we describe the Fragile Families Challenge in our paper?
  • No. There is no need to describe the Challenge in your paper. The special issue will have an introductory article describing the Challenge and data. You should assume that your readers will already have this background information.

  • Will the articles go through peer review?
  • Absolutely. All manuscripts will be reviewed by at least two people. Possible reviewers include: members of the board of the Fragile Families Challenge, qualified participants in the Challenge, members of the general reviewer pool at Socius, and other qualified researchers.

  • What are the requirements for the open source code?
  • The code must take the Fragile Families Challenge data files as an input and produce (1) all the figures and tables in your manuscript and supporting online materials and (2) your final predictions. The code can be written in any language (e.g., R, stata, Python). The code should be released under the MIT license, but we will consider other permissive licenses in special situations.

  • How long will the review process take?
  • We don’t know exactly, but we are excited about having these results in the scientific literature as quickly as possible. Therefore, we will work as quickly as possible while maintaining the quality standards of the Fragile Families Challenge and Socius.

  • Will I have access to the holdout data when writing my paper? (added July 20, 2017)
  • No, but we will allow you to request scores for your models on the holdout as described in this blog post.

  • Will I have access to the Challenge data when writing my paper? (added July 27, 2017)
  • Yes. If you will submit to the Special Issue you can continue to use the Challenge data until the Special Issue is published. If you are not submitting to the Special Issue, then you should delete the Challenge data file on August 1. Finally, participants who want to continue to do non-Challenge related research with the Fragile Families and Child Wellbeing Study can, at any time, apply for access to the core Fragile Families data by following the instructions here:

  • What if I want to submit to the special issue but I can’t exactly reproduce my submission to the Challenge? (added September 23, 2017)
  • Everyone in the Challenge was supposed to uploaded their code. But there are several reasons why they might not be able to use their code to reproduce their submission such as forgetting to set their seed or changes to the packages that were used in the submission (if you are interested, here are some general tips for promoting reproducibility Sandve et al (2013) “Ten Simple Rules for Reproducible Computational Research.”)

    If there is a tension between making your paper reproducible and making it match the submission to the Challenge exactly, you should opt to make your paper reproducible. If the code and predictions that you submit with your paper don’t exactly match what you submitted to the Challenge, you should include a note in your supporting online material explaining these differences and why they occurred. If this note will require addition information from us—such as the score of your reproducible results in the leaderboard data—we will provide it to you. We are happy to help you with these issues on a case-by-case basis.

  • I have another question, how can I ask it?
  • Send us an email:

getting started workshop, Princeton and livestream

Uncategorized No comments

We will be hosting a getting started workshop at Princeton on Friday, June 23rd from 10:30am to 4pm. This workshop will also be livestreamed at this link so even if you can’t make it to Princeton you can still participate.

During the workshop we will

  • Provide a 45 minute introduction to the Challenge and the data (slides)
  • Provide food and a friendly collaborative environment
  • Work together to produce your first submission

In addition to people just getting started, we think the workshop will be helpful for people who have already been working on the Challenge and who want to improve their submission. We will be there to answer questions both in person and through Google Hangouts during the entire event.


  • When: Friday, June 23rd from 10:30 to 4pm ET
  • Where: Julis Romo Rabinowitz Building, Room 399 and streaming here
  • RSVP: If you have not already applied to the Challenge, please mention the getting started workshop in your application. If you have already applied, please let us know that you plan to attend ( We are going to provide lunch for all participants, and we need to know how much food to order.
  • This getting started workshop will be a part of the Summer Institute for Computational Social Science.

Using .dta files in R

Uncategorized No comments
featured image

We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.

There are many ways to read .dta files into R. In this post I’ll use haven because it is part of the tidyverse.

Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):

ffc.stata <- read_dta(file = "background.dta")
ffc.csv <- read_csv(file = "background.csv")

One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.

Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.[(ffc.stata$challengeid==1104), "m1b9b11"])[(ffc.csv$challengeID==1104), "m1b9b11"])

If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.


  • The read_dta function in haven is a wrapper around the ReadStat C library.
  • The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
  • Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.

Machine-Readable Fragile Families Codebook

Uncategorized No comments

The Fragile Families and Child Wellbeing study has been running for more than 15 years. As such, it has produced an incredibly rich and complex set of documentation and codebooks. Much of this documentation was designed to be “human readable,” but, over the course of the Fragile Families Challenge, we have had several requests for a more “machine-readable” version of the documentation. Therefore, we are happy to announce that Greg Gundersen, a student in Princeton’s COS 424 (Barbara Engelhardt’s undergraduate machine learning class), has created a machine-readable version of the Fragile Families codebook in the form of a web API. We believe that this new form of documentation will make it possible for researchers to work with the data in unexpected and exciting ways.

There are three ways that you can interact with the documentation through this API.

First, you can search for words inside of question description field. For example, imagine that you are looking for all the questions that include the word “evicted”. You can find them by visiting this URL:

Just put your search term after the “q” in the above URL.

The second main way that you can interact with the new documentation is by looking up the question data associated with a variable name. For example, want to know what is “cm2relf”? Just visit:

Finally, if you just want all of the questionnaire data, visit this URL:

A main benefit of a web API is that researchers can now interact with the codebooks programmatically through URLs. For example, here is a snippet of Python 2 code that fetches the data for question “cm2mint'”:

>>> import urllib2
>>> import json
>>> response = urllib2.urlopen('')
>>> data = json.load(response)
>>> data
[{u'source file': u'', u'code': u'cm2mint', u'description': u'Constructed - Was mother interviewed at 1-year follow-up?', u'missing': u'0/4898', u'label': u'YESNO8_mw2', u'range': [0, 1], u'unique values': 2, u'units': u'1', u'type': u'numeric (byte)'}]

We are very grateful to Greg for creating this new documentation and sharing it with everyone.


  • Greg has open sourced all his code, so you can help us improve the codebook. For example, someone could write a nice front-end so that you can do more than just interact via the url.
  • The machine-readable documentation should include the following fields: description, source file, type, label, range, units, unique values, missing. If you develop code that can parse some of the missing fields, please let us know, and we can integrate your work into API.
  • The machine-readable documentation includes all the documentation that was in text files (e.g., It does not include documentation that was in pdf format (e.g.,
  • When you visit these urls, what gets returned is in JSON format, and different browsers render this JSON differently.
  • If there is a discrepancy between the machine-readable codebook and the traditional codebook, please let us know.
  • To deploy this service we used Flask, which is an open source project. Thank you to the Flask community.

Progress prizes

Uncategorized 11 comments
featured image

We were glad to receive many submissions in time for the progress prizes! As described below, we have downloaded these submissions and look forward to evaluating them and determining the best submissions at the end of the Challenge.

We are excited to announce that progress prizes will be given based on the best-performing models on Wednesday May 10, 2017 at 2pm Eastern time. We will not announce the winners, however, until after the Challenge is complete.

Here’s how it will work. On May 10, 2017 at 2pm Eastern time, we will download all the submissions on the leaderboard. However, we will not calculate which submission has the lowest error on the held-out test data until after the Challenge is complete. The reason for this delay is that we don’t want to reveal any information at all about the held-out test data until after the Challenge is over.

From the submissions that we have received by May 10, 2017 at 2pm Eastern Time, we will pick the ones that have the lowest mean-squared error on the held-out test data for each of the six outcome variables. In other words, there will be one prize for the submission that performs best for grit, and there will be another prize for the submission that performs best for grade point average, and so on.

All prize winners will be invited to participate in the post-Challenge scientific workshop at Princeton University, and we will cover all travel expenses for invited participants. If the prize-winning submission is created by a team, we will cover all travel expenses for one representative from that team.

We look forward to seeing the submissions.

upload your contribution

Uncategorized No comments
featured image

This post will walk you through the steps to prepare your files for submission and upload them to our submission platform.

1. Save your predictions as prediction.csv.

This file should be structured the same way as the “prediction.csv” file provided as part of your data bundle.

This file should have 4,242 rows: one for each observation in the test set.

We are asking you to make predictions for all 4,242 cases, which includes both the training cases from train.csv and the held-out test cases. We would prefer that you not simply copy these cases from train.csv to prediction.csv. Instead, please submit the predictions that come out of your model. This way, we can compare your performance on the training and test sets, to see whether those who fit closely to the training set perform more poorly on the test set (see our blog discussing overfitting). Your scores will be determined on the basis of test observations alone, so your predictions for the cases included in train.csv will not affect your score.
There are some observations that are truly missing: we do not have the true answer for these cases because respondents did not complete the interview or did not answer the question. This is true for both the training and the test sets. Your predictions for these cases will not affect your scores. We are asking you to make predictions for missing cases because it is possible that we will find those respondents sometime in the future and uncover the truth. It will be scientifically interesting to know how well the community model was able to predict these outcomes which even the survey staff did not know at the time of the Challenge.

This file should have 7 columns for the ID number and the 6 outcomes. They should be named:

challengeID, gpa, grit, materialHardship, eviction, layoff, jobTraining

The top of the file will look like this (numbers here are random). challengeID numbers can be in any order.


2. Save your code.

3. Create a narrative explanation of your study. This should be saved in a file called “narrative” and can be a text file, PDF, or Word document.

At the top of this narrative explanation, tell us your names of everyone on the team that produced the submission, or your name if you worked alone, in the format:

Homer Simpson,

Marge Simpson,

Then, tell us about how you developed the submission. This might include your process for preparing a the data for analysis, methods you used in the analysis, how you chose the submission you settled on, things you learned, etc.

4. Zip all the files together in one folder.

It is important that the files be zipped in a folder with no sub-directories. Instructions are different for Mac and windows.

On Mac, highlight all of the individual files.

Right click and choose “Compress 3 items”.

On Windows, highlight all of the individual files.

Right click and choose
Send to -> Compressed (zipped) folder

5. Upload the zipped folder to the submission site.

Click the “Participate” tab at the top, then the “Submit / View Results” tab on the left. Click the “Submit” button to upload your submission.

6. Wait for the platform to evaluate your submission.

Click “Refresh status” next to your latest submission to view its updated status and see results when they are ready. If successful, you will automatically be placed on the leaderboard when evaluation finishes.

build a model

Uncategorized No comments
featured image

Take our data and build models for the 6 child outcomes at age 15. Your model might draw on social science theories about variables that affect the outcomes. It might be a black-box machine learning algorithm that is hard to interpret but performs well. Perhaps your model is some entirely new combination no one has ever seen before!

The power of the Fragile Families Challenge comes from the heterogeneity of quality individual models we receive. By working together, we will harness the best of a range of modeling approaches. Be creative and show us how well your model can perform!

There are missing values. What do I do?

See our blog post on missing data.

What if I have several ideas?

You can try them all and then choose the best one! Our submission platform allows you to upload up to 10 submissions per day. Submissions will instantly be scored, and your most recent submission will be placed on the leaderboard. If you have several ideas, we suggest you upload them each individually and then upload a final submission based on the results of the individual submissions.

What if I don’t have time to make 6 models?

You can make predictions for whichever outcome interests you. To upload a submission with the appropriate file size, make a simple numeric guess for the rest of the outcomes. For instance, you might develop a careful model for grit, and then guess the mean of the training values for all of the remaining five observations. This would still allow you to upload 6 sets of predictions to the scoring function.

Apply to participate

Uncategorized No comments

The Fragile Families Challenge is now closed. We are no longer accepting applications!

What will happen after I apply?

We will review your application and be in touch by e-mail. This will likely take 2-3 business days. If we invite you to participate, you will be asked to sign a data protection agreement. Ultimately, each participant will be given a zipped folder which consolidates all of the relevant pieces of the larger Fragile Families and Child Wellbeing Study in three .csv files.

background.csv contains 4,242 rows (one per child) and 12,943 columns:

  • challengeID: A unique numeric identifier for each child.
  • 12,942 background variables asked from birth to age 9, which you may use in building your model.

train.csv contains 2,121 rows (one per child in the training set) and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables (each variable name links to a blog post about that variable)
    1. Continuous variables: grit, gpa, materialHardship
    2. Binary variables: eviction, layoff, jobTraining

prediction.csv contains 4,242 rows and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables, as in train.csv. These are filled with the mean value in the training set. This file is provided as a skeleton for your submission; you will submit a file in exactly this form but with your predictions for all 4,242 children included.

Understanding the background variables

To use the data, it may be useful to know something about what each variable (column) represents. Full documentation is available here, but this blog post distills the key points.

Waves and child ages

The background variables were collected in 5 waves.

  • Wave 1: Collected in the hospital at the child’s birth.
  • Wave 2: Collected at approximately child age 1
  • Wave 3: Collected at approximately child age 3
  • Wave 4: Collected at approximately child age 5
  • Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.

Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question  a4.

  1. The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child’s age (baby’s age).
  2. m1, m2, m3, m4, m5: Questions asked of the child’s mother in wave 1 through wave 5.
  3. f1,...,f5: Questions asked of the child's father in wave 1 through wave 5
  4. hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
  5. p5: Questions asked of the primary caregiver in wave 5.
  6. k5: Questions asked of the child (kid) in wave 5
  7. ffcc: Questions asked in various child care provider surveys in wave 3
  8. kind: Questions asked of the kindergarten teacher
  9. t5: Questions asked of the teacher in wave 5.

Ready to work with the data?

See our posts on building a model and working with missing data.