Author akindel

Author akindel

Improving metadata infrastructure for complex surveys

Uncategorized No comments

Anyone who uses survey data for research purposes knows how important metadata is for developing an understanding of a dataset’s structure and meaning. One of the big things we learned from organizing the Challenge is that machine learning methods place an extraordinary demand on metadata. Using 10k variables in a single model requires new ways of reading and using metadata to accomplish necessary data preparation tasks, and many of these tasks are not easily accomplished using the metadata infrastructure that is most commonly available in the social sciences (e.g. PDF codebooks).

To summarize how we’ve tried to improve these resources and what we learned as we undertook our redesign, we wrote a paper that will appear in a forthcoming special issue of Socius about the Fragile Families Challenge. We provide a link to the paper (on SocArXiv) as well as its abstract below; any comments or questions are most welcome!

Improving metadata infrastructure for complex surveys: 
Insights from the Fragile Families Challenge
Abstract: Researchers rely on metadata systems to prepare data for analysis. As the complexity of datasets increases and the breadth of data analysis practices grow, existing metadata systems can limit the efficiency and quality of data preparation. This article describes the redesign of a metadata system supporting the Fragile Families and Child Wellbeing Study based on the experiences of participants in the Fragile Families Challenge. We demonstrate how treating metadata as data—that is, releasing comprehensive information about variables in a format amenable to both automated and manual processing—can make the task of data preparation less arduous and less error-prone for all types of data analysis. We hope that our work will facilitate new applications of machine learning methods to longitudinal surveys and inspire research on data preparation in the social sciences. We have open-sourced the tools we created so that others can use and improve them.

Understanding your score on the holdout set

Uncategorized No comments

We were excited to release the holdout scores and announce prize winners for the Fragile Families Challenge. Our guess is that some people were pleasantly surprised by their scores and that some people were disappointed. In this post, we provide more information about how we constructed the training, leaderboard, and holdout sets, and some advice for thinking about your score. Also, if you plan to submit to the special issue of Socius—and you should—you can request scores for more than just your final submission.

Constructing the training, leaderboard, and holdout set

In order to understand your score, it is helpful to know a bit more about how we constructed the training, leaderboard, and holdout sets. We split the data into three sets: 4/8 training, 1/8 leaderboard, and 3/8 holdout.

In order to make each dataset as similar as possible, we selected them using a form of systematic sampling. We sorted observations by city of birth, mother’s relationship with the father at birth (cm1relf), mother’s race (cm1ethrace), whether at least 1 outcome is available at age 15, and the outcomes at age 15 (in this order): eviction, layoff of a caregiver, job training of a caregiver, GPA, grit, and material hardship. Once observations were sorted, we moved down the list in groups of 8 observations at a time and, for each group, randomly selected 4 observations to be in the training set, 1 to be in the leaderboard set, and 3 to be in the holdout set. This systematic sampling helped reduce the chance of us getting a “bad draw” whereby the datasets would differ substantially due to random chance.

All three datasets—training, leaderboard, holdout—include cases for which no age 15 outcomes have been collected yet. We decided to include these cases because data might be collected from them in the future and for some methodological research it might be interesting to compare predictions even if the truth is not known.

For the cases with no outcome data in the leaderboard set—but not the training and holdout sets—we added random imputed outcome data. We did this by randomly sampling outcomes with replacement from the observed outcomes in the leaderboard set. For example, the leaderboard included 304 observed cases for GPA and 226 missing cases imputed by random sampling with replacement from the observed cases.

Randomly imputing outcome data is a bit unusual. Our main reason for setting up the leaderboard this way was to develop a method for assessing model overfitting without opening the holdout set. In scientific challenges like the Fragile Families Challenge, participants can continuously improve their leaderboard scores over time, providing the appearance of real progress in constructing a good model. But, when assessed with the holdout set, that progress turns out to be an illusion: the final score is much worse than expected. This scenario is what happens when participants overfit their models to the leaderboard data. Because of this property, the leaderboard is a bad measure of progress: it misleads participants about the quality of their models. So, when calculating leaderboard score we used both real outcome data and randomly imputed outcome data. The imputed subset is randomly drawn, which means that score improvement on those observations over time is a clear indicator of overfitting to the leaderboard set. By disaggregating leaderboard scores into a real data score and an imputed data score behind the scenes, we were able to model how well participant submissions would generalize without looking at the holdout set.

If you would like to learn more about the problems with leaderboard scores, Moritz Hardt, a member of our Board of Advisors, has a paper on this problem:

Interpreting your score on the holdout set

You might be pleasantly surprised that your score on the holdout set was better than your score on the leaderboard set, or you might be disappointed that it was worse. Here are a few reasons that they might be different:

Overfitting the leaderboard: One reason that your performance might be worse on the holdout set is overfitting to the leaderboard. If you made many submissions and your submissions seemed to be improving, this might not actually be real progress. We had an earlier post on how the leaderboard can be misleading. Now that the first stage of the Challenge is complete, when you request scores on the holdout set, we will send you a graph of your performance over time on the real outcome data in the leaderboard as well as your performance on the imputed outcome data in the leaderboard. If your performance seems to be improving over time on the imputed outcome data, that is a sign that you have been overfitting to the leaderboard.

For example, consider these two case where the red line shows performance on the imputed leaderboard data and the blue line shows performance on the real leaderboard data.

In the first case, the performance on the real data improved (remember lower is better) and performance on the imputed data did not improve. In the second case, however, performance on the imputed data improves quite a bit, while performance on the real data remains relatively static. In this case, we suspect overfitting to the leaderboard, and we suspect that this person will perform worse on the holdout set than the leaderboard set.

Noisy leaderboard signal: One reason that your score might be better on the holdout set is that the leaderboard set included the randomly imputed outcome data. Your predictions for these cases were probably not very good, and there are no randomly imputed outcome cases in the holdout set (cases with no outcome data in the holdout set are ignored).

Random variation: One reason that your score on the holdout set could be higher or lower is that there are not a huge number of cases in the leaderboard set (530 people) or the holdout set (1,591 people). Also, of these roughly one third are missing outcome data. With this few cases, you should expect that your score will vary some from dataset to dataset.


We hope that this background information about the construction of the training, leaderboard, and holdout sets helps you understand your score. If you have any more questions, please let us know.

Using .dta files in Python

Uncategorized 4 comments
featured image

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.


  • Documentation for pandas.read_stata() is available here.
  • The read_stata() function accepts either a file path or a read buffer (as above).
  • The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

Progress report: COS424 @ Princeton

Uncategorized 1 comment
featured image

As we near the midpoint of the Challenge, we are excited to report on the progress of our first cluster of participants: student teams in COS424, the machine learning fundamentals course at Princeton. You can find some schematic analyses of their performance over time, modeling strategies, and more here. Some of the students have open-sourced their code for all participants to use and learn from; you can find that code here.

Thanks to all the COS424 students for their awesome contributions!

Code Validation

Uncategorized No comments
featured image

As part of the challenge, we’re interested in understanding and learning from the strategies participants are using to predict outcomes in the Fragile Families data. One major goal of the challenge is to learn how these strategies evolve and develop over time. We think that a more systematic understanding of how social scientists and data scientists think with data has the potential to better inform how statistical analysis is done. To do this analysis, we use the code and narrative analysis included with each submission.

Recently, we updated the code that evaluates predictions to ensure that groups don’t forget to include their code in their submissions.

What does this mean for me?

  • Make sure that your directory contains all of the code you used to generate your predictions.
  • It’s not a problem if the code is in multiple/un-executable scripts. When we look over code submissions, we don’t execute the code.
  • If you run into an error when you submit your predictions that says you’ve forgotten your code, but your submission does actually contain the code you’ve been using, let us know as soon as possible!