Month May 2017

Month May 2017

Stata .dta file with metadata

Uncategorized No comments
featured image

In response to many requests from Challenge participants, we are now able to provide a .dta file in Stata 14 format. This file contains metadata which we hope will help participants to find variables of interest more easily.

Contents of the .dta file

If you have been working with our background.csv file and the codebooks available at fragilefamilies.princeton.edu, then this .dta file provides the same information you already had, but in a new format.

  • Each variable has an associated label which contains a truncated version of the survey question text.
  • For each categorical variable, the text meaning of each numeric level of that variable is recorded with a value label.

You are welcome to build models from the .csv file or from the .dta file.

Distribution of the .dta file

All new applicants to the Challenge will receive a zipped folder containing both background.csv and background.dta.

Anyone who received the data on or before May 24, 2017 may send an email to fragilefamilieschallenge@gmail.com to request a new version of the data file.

Using the .dta file

Stata users can easily load the .dta file, which is in Stata format.

We have prepared a blog post about using the .dta file in R and about using the .dta file in Python to facilitate use of the file in these other software packages.

We hope the metadata in this file enables everyone to build better models more easily!

Using .dta files in R

Uncategorized 2 comments
featured image

We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.

There are many ways to read .dta files into R. In this post I’ll use haven because it is part of the tidyverse.

Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):

library(tidyverse)
library(haven)
ffc.stata <- read_dta(file = "background.dta")
ffc.csv <- read_csv(file = "background.csv")

One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.

Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.

is.na(ffc.stata[(ffc.stata$challengeid==1104), "m1b9b11"])
is.na(ffc.csv[(ffc.csv$challengeID==1104), "m1b9b11"])

If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.

Notes:

  • The read_dta function in haven is a wrapper around the ReadStat C library.
  • The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
  • Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.

Using .dta files in Python

Uncategorized 4 comments
featured image

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.

Notes

  • Documentation for pandas.read_stata() is available here.
  • The read_stata() function accepts either a file path or a read buffer (as above).
  • The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

Machine-Readable Fragile Families Codebook

Uncategorized No comments

The Fragile Families and Child Wellbeing study has been running for more than 15 years. As such, it has produced an incredibly rich and complex set of documentation and codebooks. Much of this documentation was designed to be “human readable,” but, over the course of the Fragile Families Challenge, we have had several requests for a more “machine-readable” version of the documentation. Therefore, we are happy to announce that Greg Gundersen, a student in Princeton’s COS 424 (Barbara Engelhardt’s undergraduate machine learning class), has created a machine-readable version of the Fragile Families codebook in the form of a web API. We believe that this new form of documentation will make it possible for researchers to work with the data in unexpected and exciting ways.

There are three ways that you can interact with the documentation through this API.

First, you can search for words inside of question description field. For example, imagine that you are looking for all the questions that include the word “evicted”. You can find them by visiting this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/?q=evicted

Just put your search term after the “q” in the above URL.

The second main way that you can interact with the new documentation is by looking up the question data associated with a variable name. For example, want to know what is “cm2relf”? Just visit:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2relf

Finally, if you just want all of the questionnaire data, visit this URL:
https://codalab.fragilefamilieschallenge.org/f/api/codebook/

A main benefit of a web API is that researchers can now interact with the codebooks programmatically through URLs. For example, here is a snippet of Python 2 code that fetches the data for question “cm2mint'”:

>>> import urllib2
>>> import json
>>> response = urllib2.urlopen('https://codalab.fragilefamilieschallenge.org/f/api/codebook/cm2mint')
>>> data = json.load(response)
>>> data
[{u'source file': u'http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb1.txt', u'code': u'cm2mint', u'description': u'Constructed - Was mother interviewed at 1-year follow-up?', u'missing': u'0/4898', u'label': u'YESNO8_mw2', u'range': [0, 1], u'unique values': 2, u'units': u'1', u'type': u'numeric (byte)'}]

We are very grateful to Greg for creating this new documentation and sharing it with everyone.

Notes:

  • Greg has open sourced all his code, so you can help us improve the codebook. For example, someone could write a nice front-end so that you can do more than just interact via the url.
  • The machine-readable documentation should include the following fields: description, source file, type, label, range, units, unique values, missing. If you develop code that can parse some of the missing fields, please let us know, and we can integrate your work into API.
  • The machine-readable documentation includes all the documentation that was in text files (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_dad_cb5.txt). It does not include documentation that was in pdf format (e.g., http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_hv_cb5.pdf).
  • When you visit these urls, what gets returned is in JSON format, and different browsers render this JSON differently.
  • If there is a discrepancy between the machine-readable codebook and the traditional codebook, please let us know.
  • To deploy this service we used Flask, which is an open source project. Thank you to the Flask community.

Final submission deadline

Uncategorized No comments
featured image

The final submission deadline for the Fragile Families Challenge will be
2pm Eastern Daylight Time on Tuesday, August 1, 2017.

While it is tempting to stay open indefinitely to continue collecting high-quality submissions, closing is important so that we can conduct the targeted interviews within a reasonable timespan after the original interview, and so that the Fragile Families and Child Wellbeing Study can make the full data available to researchers.

How much should I trust the leaderboard?

Uncategorized No comments
featured image

The leaderboard on the Fragile Families Challenge submission site is often the first thing participants focus on. It is therefore important to understand!

Why do we like the leaderboard?

The leaderboard:

  • shows rankings in real-time, motivating better submissions
  • demonstrates that models that predict well in the training data do not necessarily perform well in an out-of-sample test
  • makes the Challenge more fun!

Understanding the data split

However, the leaderboard is only a small portion of the overall data. In fact, the observations (rows) in the data are split into:

  • 4/8 training data
  • 1/8 leaderboard data
  • 3/8 test data

As discussed in our blog post on evaluating submissions, final evaluation will be done on a separate set of held-out test data – the 3/8 portion referenced above. This means all awards (including the progress prizes) will be conducted on the test data, not the leaderboard. Likewise, our follow-up interviews will focus on the test set observations that were not used for training. Separation between the leaderboard and test sets is important; the leaderboard set isn’t truly held out since everyone receives repeated feedback from this set throughout the challenge!

Implications for strategy

What does this mean for your ideal strategy? How can you best make use of the leaderboard?

  • The leaderboard gives an instant snapshot of your out-of-sample performance. This can be useful in evaluating your model, much as splitting your own training set can be helpful.
  • However, over-fitting to the leaderboard will only hurt your score in the final test set evaluation
  • Leaderboard scores are noisy measures of generalization error because they are based on a small sample. So, even as a measure of generalization error, the leaderboard should be interpreted cautiously!

In summary, we expect some models to perform better in the final evaluation than the leaderboard suggests, due to random noise. Likewise, some models will look good on the leaderboard but perform poorly in the final evaluation because they got lucky in the leaderboard. Some submissions may even under-perform in the final evaluation because they made too many modeling adjustments to fit closely to idiosyncrasies of the leaderboard!

Your final evaluation will not be based on the leaderboard, so you are best advised to use it cautiously as one (noisy) bit of information about your generalization error.