Our Blog

Our Blog

Apply to participate

Uncategorized No comments

The Fragile Families Challenge is now closed. We are no longer accepting applications!


What will happen after I apply?

We will review your application and be in touch by e-mail. This will likely take 2-3 business days. If we invite you to participate, you will be asked to sign a data protection agreement. Ultimately, each participant will be given a zipped folder which consolidates all of the relevant pieces of the larger Fragile Families and Child Wellbeing Study in three .csv files.

background.csv contains 4,242 rows (one per child) and 12,943 columns:

  • challengeID: A unique numeric identifier for each child.
  • 12,942 background variables asked from birth to age 9, which you may use in building your model.

train.csv contains 2,121 rows (one per child in the training set) and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables (each variable name links to a blog post about that variable)
    1. Continuous variables: grit, gpa, materialHardship
    2. Binary variables: eviction, layoff, jobTraining

prediction.csv contains 4,242 rows and 7 columns:

  • challengeID: A unique numeric identifier for each child.
  • Six outcome variables, as in train.csv. These are filled with the mean value in the training set. This file is provided as a skeleton for your submission; you will submit a file in exactly this form but with your predictions for all 4,242 children included.

Understanding the background variables

To use the data, it may be useful to know something about what each variable (column) represents. Full documentation is available here, but this blog post distills the key points.

Waves and child ages

The background variables were collected in 5 waves.

  • Wave 1: Collected in the hospital at the child’s birth.
  • Wave 2: Collected at approximately child age 1
  • Wave 3: Collected at approximately child age 3
  • Wave 4: Collected at approximately child age 5
  • Wave 5: Collected at approximately child age 9

Note that wave numbers are not the same as child ages. The variable names and survey documentation are organized by wave number.

Variable naming conventions

Predictor variables are identified by a prefix and a question number. Prefixes the survey in which a question was collected. This is useful because the documentation is organized by survey. For instance the variable m1a4 refers to the mother interview in wave 1, question  a4.

  1. The prefix c in front of any variable indicates variables constructed from other responses. For instance, cm4b_age is constructed from the mother wave 4 interview, and captures the child’s age (baby’s age).
  2. m1, m2, m3, m4, m5: Questions asked of the child’s mother in wave 1 through wave 5.
  3. f1,...,f5: Questions asked of the child's father in wave 1 through wave 5
  4. hv3, hv4, hv5: Questions asked in the home visit in waves 3, 4, and 5.
  5. p5: Questions asked of the primary caregiver in wave 5.
  6. k5: Questions asked of the child (kid) in wave 5
  7. ffcc: Questions asked in various child care provider surveys in wave 3
  8. kind: Questions asked of the kindergarten teacher
  9. t5: Questions asked of the teacher in wave 5.

Ready to work with the data?

See our posts on building a model and working with missing data.

About Matt Salganik

Matthew Salganik is a Professor of Sociology at Princeton University. He is also the author of Bit by Bit: Social Research in the Digital Age (http://www.bitbybitbook.com). You can learn more about his research at http://www.princeton.edu/~mjs3.

Add your comment