Month June 2017

Month June 2017

Call for papers, special issue of Socius about the Fragile Families Challenge

Uncategorized No comments

Socius Call for Papers
Special issue on the Fragile Families Challenge
Guest editors: Matthew J. Salganik and Sara McLanahan

Socius, an open access journal published by the American Sociological Association, will publish a special issue on the predictive modeling phase of the Fragile Families Challenge. All participants in the Fragile Families Challenge are invited to submit a manuscript to this special issue.

A strong manuscript for the special issue will describe the process of creating a submission to the Challenge and will describe what was learned during that process. For example, a strong manuscript will describe the different approaches that were considered for data preprocessing, variable selection, missing data, model selection, and any other steps involved in creating the final submission to the Challenge. Further, a strong manuscript will also describe how the authors decided among the many possible approaches. Finally, some manuscripts may seek to draw more general lessons about social inequality, families, the common task method, social science, data science, or computational social science. Manuscript should be written in a style that is accessible to a general scientific audience.

The editors of the special issue may also consider other types of manuscripts that are consistent with the scientific goals of the Fragile Families Challenge. If you are considering submission a manuscript different from what is described above, please contact the editors of the special issue at before submitting your manuscript.

All papers will be peer reviewed, and publication is not guaranteed. However, there is no limit on the number of articles that will be accepted in the special issue. All published papers must abide by the terms and conditions of the Fragile Families Challenge, and must be accompanied by open source code and a data file containing predictions.

Submissions for the special issue must be received through the Socius online submission platform by Sunday, October 1, 2017 Monday, October 16 at 11:59pm ET. If you have any questions about the special issue, please email


  • Do I need to describe an approach to predicting all six outcome variables in order to submit to the special issue?
  • No. We will happily consider papers that focus on one specific outcome variable.

  • Do I need to have a low mean-squared error in order for my paper to be published?
  • No. Predictive performance in the held-out dataset is only part of what we will consider. For example, a paper that clearly shows that many common strategies were not very effective would be considered a valuable contribution.

  • What if I can’t afford the Article Processing Charge?
  • Socius, like most open access journals, has an Article Processing Charge. This charge is required to keep Socius running, and it is in line with the charges at other open access journals. However, we strongly believe that the Article Processing Charge should not be a barrier to scientific participation. Therefore, the Fragile Families Challenge project will pay the Article Processing Charge for all accepted articles submitted by everyone except for tenure-track (or equivalent) faculty working in universities in OECD countries. In other words, we will cover the Article Processing Charge for undergraduates, graduate students, post-docs, and people working outside of universities. Further, we will pay the Article Processing Charge for all tenure-track (or equivalent) faculty working in universities outside the OECD.

    If for any reason you think that the Article Processing Charge may be a barrier to your participation, please send us an email and we will try to find a solution:

  • How will you decide what manuscripts to accept for publication?
  • Articles in Socius are judged by four criteria: Accuracy, Novelty, Interest, and Presentation. In the case of this special issue, these criteria will be judged by the editors of the special issue, with feedback from reviewers and the editors of Socius. For the purposes of this special issue, here is how these criteria will be interpreted:

    • Accuracy: The key question is whether this analysis was conducted appropriately and accurately. Were the techniques used in the manuscript performed and interpreted correctly? Do the claims in the manuscript match the evidence provided?
    • Novelty: The key question is whether the manuscript will be novel to some social scientists or some data scientists. Because projects like the Fragile Families Challenge are not yet common, we expect that most submitted manuscripts will be somewhat novel.
    • Interest: The key question for the editors is whether the manuscript will be interesting to some social scientists or some data scientists. Will some people want to read this paper? Does it advance understanding of the Fragile Families Challenge and related intellectual domains?
    • Presentation: The key question is whether this manuscript communicates effectively to a diverse audience of social scientist and data scientists. We will also assess whether the figures and tables are presented clearly and whether the manuscript makes appropriate use of the opportunity for supporting online information. Because these manuscripts will be short, we expect that the supporting online information will play a key role.

  • Who is the audience for these papers?
  • All papers should be written for a general scientific audience that will include both social scientists and data scientists (broadly defined). In other words, when writing your paper you should imagine an audience similar to the audience at journals such as Science and Proceedings of the National Academies of Sciences (PNAS). We would recommend reading some articles from these journals to get a sense of this style. Manuscripts that use excessive jargon from a specific field will be asked to make revisions.

    Manuscripts should follow the length guidelines of a Report published in Science: 2,500 words, with up to 4 figures or tables. Additional materials should be included in supporting online materials. We will consider articles that deviate from these guidelines in some situations. Other aspects of the manuscript format will follow standard Socius rules.

  • Should we describe the Fragile Families Challenge in our paper?
  • No. There is no need to describe the Challenge in your paper. The special issue will have an introductory article describing the Challenge and data. You should assume that your readers will already have this background information.

  • Will the articles go through peer review?
  • Absolutely. All manuscripts will be reviewed by at least two people. Possible reviewers include: members of the board of the Fragile Families Challenge, qualified participants in the Challenge, members of the general reviewer pool at Socius, and other qualified researchers.

  • What are the requirements for the open source code?
  • The code must take the Fragile Families Challenge data files as an input and produce (1) all the figures and tables in your manuscript and supporting online materials and (2) your final predictions. The code can be written in any language (e.g., R, stata, Python). The code should be released under the MIT license, but we will consider other permissive licenses in special situations.

  • How long will the review process take?
  • We don’t know exactly, but we are excited about having these results in the scientific literature as quickly as possible. Therefore, we will work as quickly as possible while maintaining the quality standards of the Fragile Families Challenge and Socius.

  • Will I have access to the holdout data when writing my paper? (added July 20, 2017)
  • No, but we will allow you to request scores for your models on the holdout as described in this blog post.

  • Will I have access to the Challenge data when writing my paper? (added July 27, 2017)
  • Yes. If you will submit to the Special Issue you can continue to use the Challenge data until the Special Issue is published. If you are not submitting to the Special Issue, then you should delete the Challenge data file on August 1. Finally, participants who want to continue to do non-Challenge related research with the Fragile Families and Child Wellbeing Study can, at any time, apply for access to the core Fragile Families data by following the instructions here:

  • What if I want to submit to the special issue but I can’t exactly reproduce my submission to the Challenge? (added September 23, 2017)
  • Everyone in the Challenge was supposed to uploaded their code. But there are several reasons why they might not be able to use their code to reproduce their submission such as forgetting to set their seed or changes to the packages that were used in the submission (if you are interested, here are some general tips for promoting reproducibility Sandve et al (2013) “Ten Simple Rules for Reproducible Computational Research.”)

    If there is a tension between making your paper reproducible and making it match the submission to the Challenge exactly, you should opt to make your paper reproducible. If the code and predictions that you submit with your paper don’t exactly match what you submitted to the Challenge, you should include a note in your supporting online material explaining these differences and why they occurred. If this note will require addition information from us—such as the score of your reproducible results in the leaderboard data—we will provide it to you. We are happy to help you with these issues on a case-by-case basis.

  • I have another question, how can I ask it?
  • Send us an email:

Helpful idea: Read prior research

Uncategorized No comments
featured image

Not an expert in child development, poverty, or family sociology? Participants often wonder how they can contribute if they have no prior knowledge of these fields. Luckily, there are a few resources to bring you up to speed quickly!

Fact sheet

The Fragile Families and Child Wellbeing Study (FFCWS) Fact Sheet can quickly introduce the key findings from the broader FFCWS. For instance, the study discovered that “single” parenthood is a bit of a misnomer; about half of the unmarried parents in the sample were actually living together when the child was born! Yet many of these couples subsequently separated.

Research briefs

Looking for mored detailed information on a particular subfield? The Fragile Families Research Briefs provide accessible summaries of cutting edge research using the data.

Publication collection

Want to know how social scientists are using the data right now? The Fragile Families publication collection lists hundreds of published articles and working papers using the Fragile Families and Child Wellbeing Study. If you want to see how social scientists have used the data and get ideas for variables you may want to include in your models, the publication collection is a good place to start.

Other publications

A more exhaustive list of published resources is available here.

Helpful ideas series

This is the first in a series of blog posts with helpful ideas to help you build better models – look for more to come soon! For email notifications when we make new posts, subscribe in the box at the top right of this page.

Getting started quickly in the Fragile Families Challenge

Uncategorized No comments
featured image

Want to build your first submission to the Fragile Families Challenge in an hour? In this post, we’ll tell you the trick to getting started quickly: the constructed variables.

If you’ve never worked with the Fragile Families data before it can seem daunting. The background file contains 12,943 variables (columns) for 4,242 children (rows), but 56% of the cells in this matrix are missing! Participants often begin by trying to read all the documentation, clean all of the variables, and impute reasonable values for the missing cells. This quickly becomes demoralizing. What else can you do?

Our overall recommendation is to begin with the constructed variables. These 600 variables were “constructed” by the Fragile Families research staff in order to help future researchers, and they were constructed based on multiple reports in order to reduce missing data. For example, the variable cm1relf consolidates the key information from 5 questions asked of the mother about her relationship with the father at the birth of the child. The constructed variables are a great place to start because they:

  • represent constructs social scientists believe to be important
  • have very little missing data
  • are easy to identify because they begin with the letter c (i.e. cm1ethrace is constructed wave 1 mother’s ethnicity and race)
    • There are a small number of exceptions to this convention. For instance, the variable t5tint is a constructed variable indicating whether the teacher was interviewed in wave 5. However, the vast majority of constructed variables begin with c.
    • When we say that constructed variables have little missing data, this statement is restricted to constructed variables that have some data all. In other words, there are some constructed variables are all NA in the Challenge file (e.g., cm1tdiff).

These constructed variables are more fully documented on p. 13-20 of the general study documentation. Further, they are also summarized in this participant-generated open-source dictionary.

A good strategy to get started quickly is to pick some constructed variables, build a very simple model, and get yourself on the leaderboard! You can always build up from there. Participants often begin with cm1ethrace, cf1ethrace, cm1edu, cf1edu, and cm1relf.

Even if you start with the constructed variables, you will be frustrated by missing data. As summarized in our blog post, there is no perfect solution to this problem. We recommend the following workflow:

  1. Start with a small fraction of the total variables. Focus on imputing the missing values for this subset, rather than for all variables in the entire file.
  2. Decide how to address informative missing values (i.e. -6, valid skip). For categorical variables, you might treat valid skips as their own category.
  3. Impute remaining missing values with mean or median imputation. We know that mean or median imputation aren’t great, but they are a reasonable starting point, and you can move to model-based imputation later.
  4. Fit models on your imputed dataset.

Constructed variables – data dictionary

Uncategorized No comments
featured image

We are happy to announce that Challenge participants Aarshay Jain, Bindia Kalra, and Keerti Agrawal at Columbia University have created a new resource that should make working the Challenge data more efficient. More specifically, they created an alternative data dictionary for the constructed variables (FFC_Data_Dictionary.xlsx). They have made it available open-source here.

Their dictionary:

  • Summarizes constructed variable prefixes and suffixes
  • Categorizes questions by the respondent to and subject of the question
  • Provides examples of questions from a variety of substantive categories

As discussed in our blog post on getting started quickly, the constructed variables are a good place to start when choosing variables to include in your model. These variables are summarized on p. 13-20 of the general study documentation.

The official Fragile Families and Child Wellbeing Study site is still the authoritative source of documentation, but we hope this open source contribution helps you more quickly understand the variables available and how to find them.

The open-source movement is exciting because it unlocks the power of what we can do by collaboration. Much like a Wikipedia page benefits when hundreds of people view it and think about improvements they could make, so too will the open-source resources for the Fragile Families Challenge shine if others get involved when they think of possible improvements. If you think you can make this data dictionary better, please jump in, open-source your new version, and let us know so we can publicize it! In fact, Aarshay, Bindia, and Keerti would love to see these kind of improvements. Likewise, we welcome any other open-source contributions that you think might make the Challenge better.

Many thanks to Aarshay, Bindia, and Keerti for making it easier for others to use the data!

getting started workshop, Princeton and livestream

Uncategorized No comments

We will be hosting a getting started workshop at Princeton on Friday, June 23rd from 10:30am to 4pm. This workshop will also be livestreamed at this link so even if you can’t make it to Princeton you can still participate.

During the workshop we will

  • Provide a 45 minute introduction to the Challenge and the data (slides)
  • Provide food and a friendly collaborative environment
  • Work together to produce your first submission

In addition to people just getting started, we think the workshop will be helpful for people who have already been working on the Challenge and who want to improve their submission. We will be there to answer questions both in person and through Google Hangouts during the entire event.


  • When: Friday, June 23rd from 10:30 to 4pm ET
  • Where: Julis Romo Rabinowitz Building, Room 399 and streaming here
  • RSVP: If you have not already applied to the Challenge, please mention the getting started workshop in your application. If you have already applied, please let us know that you plan to attend ( We are going to provide lunch for all participants, and we need to know how much food to order.
  • This getting started workshop will be a part of the Summer Institute for Computational Social Science.

Getting started with Stata

Uncategorized No comments
featured image

This post summarizes how to work on the Fragile Families Challenge data in Stata.

We only cover the basics here. For more detailed example code, see our open-source repository, thanks to Jeremy Freese.

How do I import the data?

Before loading the data, you may need to increase the number of variables Stata will hold.
set maxvar 13000

Then, change your working directory to the place where the file is located, using
cd your_directory.

Load the training outcomes
import delimited train.csv, clear case(preserve) numericcols(_all)
Two options there are critical:

  • The case(preserve) option ensures that the case of variable names is preserved. Omitting this option will produce errors in your submission since capitalization in variable names is required (i.e. challengeID), but Stata’s default makes all variable names lower case.
  • The numericcols(_all) option ensures that the outcomes are read as numeric,
    rather than as character strings.

Merge the background variables to that file using the challengeID identifier.
merge 1:1 challengeID using background.dta

  • You will see that 2,121 observations were in both datasets. These are the training observations for which we are providing the age 15 outcomes.
  • You will also see that 2,121 observations were only in the using file, since the background variables but not the outcomes are available for these cases. These are the test cases on which your predictions will be evaluated.

If you have an older version of Stata, you may not be able to open the .dta file with metadata. You can still load the background file from the .csv format. To do that, you should first load the .csv file and save it in a .dta format you can use. Then, follow the instructions above.
import delimited background.csv, clear case(preserve)
save background.dta, replace

Again, note the important case(preserve) option!

How do I make predictions?

If your model is linear or logistic regression, then you can use the predict function.
regress gpa your_predictors
predict pred_gpa, replace

Then the variable gpa_pred has your predictions for GPA. You can do this for all 6 outcomes.

How do I export my submission?

This section assumes your predicted values are named pred_gpa, pred_grit, etc. First, select only the identifier and the predictions.
keep challengeID pred_*
Then, rename all your predictions to not have the prefix pred_
local outcomes gpa grit materialHardship eviction layoff jobTraining
foreach outcome of local outcomes {
rename pred_`outcome' `outcome'

Finally, export the prediction file as a .csv.
export delimited using prediction.csv, replace
Finally, bundle this with your code and narrative description as described in the blog post on uploading your contribution!