Author ilundberg

Author ilundberg

Helpful idea: Compare to the baseline

Uncategorized No comments

Participants often ask us if their scores on the leaderboard are “good”. One way to answer that question is with a comparison to the baseline model.

In the course of discussing how a very simple model could beat a more complex model, this post will also discuss the concept of overfitting to the training data and how this could harm predictive performance.

What is the baseline model?

We have introduced a baseline model to the leaderboard, with the username “baseline.” Our baseline prediction file simply takes the mean of each outcome in the training data, and predicts that mean value for all observations. We provided this file as “prediction.csv” in the original data folder sent to all participants.

How is the baseline model performing?

As of the writing of this post (12:30pm EDT on 15 July 2017), the baseline model ranks as follows, with 1 being the best score:

  • 70 / 170 unique scores for GPA
  • 37 / 128 for grit
  • 60 / 99 for material hardship
  • 37 / 96 for eviction
  • 32 / 85 for layoff
  • 30 / 87 for job training

In all cases except for material hardship, the baseline model is in the top half of scores!

A quick way to evaluate the performance of your model is to see the extent to which it improves over the baseline score.

How can the baseline do so well?

How can a model with no predictors outperform a model with predictors? One source of this conundrum is the problem of overfitting.

As the complexity of a model increases, the model becomes more able to fit the idiosyncracies of the training data. If these idiosyncracies represent something true about the world, then the more complex fit might also create better predictions in the test data.

However, at some point, a complex model will begin to pick up random noise in the training data. This will reduce prediction error in the training sample, but can make predictions worse in the test sample!


Note: Figure inspired by Figure 7.1 in The Elements of Statistical Learning by Hastie, Tibshirani, and Freedman, which provides a more thorough overview of the problem of overfitting and the bias-variance tradeoff.

How can this be? A classical result in statistics shows that the mean squared prediction error can be decomposed into the bias squared plus the variance. Thus, even if additional predictors reduce the bias in predictions, they can harm predictive performance if they substantially increase the variance of predictions by incorporating random noise.

What can be done?

We have been surprised at how a seemingly small number of variables can yield problems of overfitting in the Fragile Families Challenge. A few ways to combat this problem are:

  • Choose a small number of predictors carefully based on theory
  • Use a penalized regression approach such as LASSO or ridge regression.
    • For an intuitive introduction to these approaches, see Efron and Hastie Computer Age Statistical Inference [book site], sections 7.3 and 16.2.
    • The glmnet package in R [link] is an easy-to-use implementation of these methods. Many other software options are also available.
  • Use cross-validation to estimate your model’s generalization error within the training set. For an introduction, see chapter 12 of Efron and Hastie [book site]

But at minimum, compare yourself to the baseline to make sure you are doing better than a naive prediction of the mean!

Helpful idea: Read prior research

Uncategorized No comments
featured image

Not an expert in child development, poverty, or family sociology? Participants often wonder how they can contribute if they have no prior knowledge of these fields. Luckily, there are a few resources to bring you up to speed quickly!

Fact sheet

The Fragile Families and Child Wellbeing Study (FFCWS) Fact Sheet can quickly introduce the key findings from the broader FFCWS. For instance, the study discovered that “single” parenthood is a bit of a misnomer; about half of the unmarried parents in the sample were actually living together when the child was born! Yet many of these couples subsequently separated.

Research briefs

Looking for mored detailed information on a particular subfield? The Fragile Families Research Briefs provide accessible summaries of cutting edge research using the data.

Publication collection

Want to know how social scientists are using the data right now? The Fragile Families publication collection lists hundreds of published articles and working papers using the Fragile Families and Child Wellbeing Study. If you want to see how social scientists have used the data and get ideas for variables you may want to include in your models, the publication collection is a good place to start.

Other publications

A more exhaustive list of published resources is available here.

Helpful ideas series

This is the first in a series of blog posts with helpful ideas to help you build better models – look for more to come soon! For email notifications when we make new posts, subscribe in the box at the top right of this page.

Getting started quickly in the Fragile Families Challenge

Uncategorized No comments
featured image

Want to build your first submission to the Fragile Families Challenge in an hour? In this post, we’ll tell you the trick to getting started quickly: the constructed variables.

If you’ve never worked with the Fragile Families data before it can seem daunting. The background file contains 12,943 variables (columns) for 4,242 children (rows), but 56% of the cells in this matrix are missing! Participants often begin by trying to read all the documentation, clean all of the variables, and impute reasonable values for the missing cells. This quickly becomes demoralizing. What else can you do?

Our overall recommendation is to begin with the constructed variables. These 600 variables were “constructed” by the Fragile Families research staff in order to help future researchers, and they were constructed based on multiple reports in order to reduce missing data. For example, the variable cm1relf consolidates the key information from 5 questions asked of the mother about her relationship with the father at the birth of the child. The constructed variables are a great place to start because they:

  • represent constructs social scientists believe to be important
  • have very little missing data
  • are easy to identify because they begin with the letter c (i.e. cm1ethrace is constructed wave 1 mother’s ethnicity and race)
    • There are a small number of exceptions to this convention. For instance, the variable t5tint is a constructed variable indicating whether the teacher was interviewed in wave 5. However, the vast majority of constructed variables begin with c.
    • When we say that constructed variables have little missing data, this statement is restricted to constructed variables that have some data all. In other words, there are some constructed variables are all NA in the Challenge file (e.g., cm1tdiff).

These constructed variables are more fully documented on p. 13-20 of the general study documentation. Further, they are also summarized in this participant-generated open-source dictionary.

A good strategy to get started quickly is to pick some constructed variables, build a very simple model, and get yourself on the leaderboard! You can always build up from there. Participants often begin with cm1ethrace, cf1ethrace, cm1edu, cf1edu, and cm1relf.

Even if you start with the constructed variables, you will be frustrated by missing data. As summarized in our blog post, there is no perfect solution to this problem. We recommend the following workflow:

  1. Start with a small fraction of the total variables. Focus on imputing the missing values for this subset, rather than for all variables in the entire file.
  2. Decide how to address informative missing values (i.e. -6, valid skip). For categorical variables, you might treat valid skips as their own category.
  3. Impute remaining missing values with mean or median imputation. We know that mean or median imputation aren’t great, but they are a reasonable starting point, and you can move to model-based imputation later.
  4. Fit models on your imputed dataset.

Constructed variables – data dictionary

Uncategorized No comments
featured image

We are happy to announce that Challenge participants Aarshay Jain, Bindia Kalra, and Keerti Agrawal at Columbia University have created a new resource that should make working the Challenge data more efficient. More specifically, they created an alternative data dictionary for the constructed variables (FFC_Data_Dictionary.xlsx). They have made it available open-source here.

Their dictionary:

  • Summarizes constructed variable prefixes and suffixes
  • Categorizes questions by the respondent to and subject of the question
  • Provides examples of questions from a variety of substantive categories

As discussed in our blog post on getting started quickly, the constructed variables are a good place to start when choosing variables to include in your model. These variables are summarized on p. 13-20 of the general study documentation.

The official Fragile Families and Child Wellbeing Study site is still the authoritative source of documentation, but we hope this open source contribution helps you more quickly understand the variables available and how to find them.

The open-source movement is exciting because it unlocks the power of what we can do by collaboration. Much like a Wikipedia page benefits when hundreds of people view it and think about improvements they could make, so too will the open-source resources for the Fragile Families Challenge shine if others get involved when they think of possible improvements. If you think you can make this data dictionary better, please jump in, open-source your new version, and let us know so we can publicize it! In fact, Aarshay, Bindia, and Keerti would love to see these kind of improvements. Likewise, we welcome any other open-source contributions that you think might make the Challenge better.

Many thanks to Aarshay, Bindia, and Keerti for making it easier for others to use the data!

Getting started with Stata

Uncategorized No comments
featured image

This post summarizes how to work on the Fragile Families Challenge data in Stata.

We only cover the basics here. For more detailed example code, see our open-source repository, thanks to Jeremy Freese.

How do I import the data?

Before loading the data, you may need to increase the number of variables Stata will hold.
set maxvar 13000

Then, change your working directory to the place where the file is located, using
cd your_directory.

Load the training outcomes
import delimited train.csv, clear case(preserve) numericcols(_all)
Two options there are critical:

  • The case(preserve) option ensures that the case of variable names is preserved. Omitting this option will produce errors in your submission since capitalization in variable names is required (i.e. challengeID), but Stata’s default makes all variable names lower case.
  • The numericcols(_all) option ensures that the outcomes are read as numeric,
    rather than as character strings.

Merge the background variables to that file using the challengeID identifier.
merge 1:1 challengeID using background.dta

  • You will see that 2,121 observations were in both datasets. These are the training observations for which we are providing the age 15 outcomes.
  • You will also see that 2,121 observations were only in the using file, since the background variables but not the outcomes are available for these cases. These are the test cases on which your predictions will be evaluated.

If you have an older version of Stata, you may not be able to open the .dta file with metadata. You can still load the background file from the .csv format. To do that, you should first load the .csv file and save it in a .dta format you can use. Then, follow the instructions above.
import delimited background.csv, clear case(preserve)
save background.dta, replace

Again, note the important case(preserve) option!

How do I make predictions?

If your model is linear or logistic regression, then you can use the predict function.
regress gpa your_predictors
predict pred_gpa, replace

Then the variable gpa_pred has your predictions for GPA. You can do this for all 6 outcomes.

How do I export my submission?

This section assumes your predicted values are named pred_gpa, pred_grit, etc. First, select only the identifier and the predictions.
keep challengeID pred_*
Then, rename all your predictions to not have the prefix pred_
local outcomes gpa grit materialHardship eviction layoff jobTraining
foreach outcome of local outcomes {
rename pred_`outcome' `outcome'
}

Finally, export the prediction file as a .csv.
export delimited using prediction.csv, replace
Finally, bundle this with your code and narrative description as described in the blog post on uploading your contribution!

Stata .dta file with metadata

Uncategorized No comments
featured image

In response to many requests from Challenge participants, we are now able to provide a .dta file in Stata 14 format. This file contains metadata which we hope will help participants to find variables of interest more easily.

Contents of the .dta file

If you have been working with our background.csv file and the codebooks available at fragilefamilies.princeton.edu, then this .dta file provides the same information you already had, but in a new format.

  • Each variable has an associated label which contains a truncated version of the survey question text.
  • For each categorical variable, the text meaning of each numeric level of that variable is recorded with a value label.

You are welcome to build models from the .csv file or from the .dta file.

Distribution of the .dta file

All new applicants to the Challenge will receive a zipped folder containing both background.csv and background.dta.

Anyone who received the data on or before May 24, 2017 may send an email to fragilefamilieschallenge@gmail.com to request a new version of the data file.

Using the .dta file

Stata users can easily load the .dta file, which is in Stata format.

We have prepared a blog post about using the .dta file in R and about using the .dta file in Python to facilitate use of the file in these other software packages.

We hope the metadata in this file enables everyone to build better models more easily!

Final submission deadline

Uncategorized No comments
featured image

The final submission deadline for the Fragile Families Challenge will be
2pm Eastern Daylight Time on Tuesday, August 1, 2017.

While it is tempting to stay open indefinitely to continue collecting high-quality submissions, closing is important so that we can conduct the targeted interviews within a reasonable timespan after the original interview, and so that the Fragile Families and Child Wellbeing Study can make the full data available to researchers.

How much should I trust the leaderboard?

Uncategorized No comments
featured image

The leaderboard on the Fragile Families Challenge submission site is often the first thing participants focus on. It is therefore important to understand!

Why do we like the leaderboard?

The leaderboard:

  • shows rankings in real-time, motivating better submissions
  • demonstrates that models that predict well in the training data do not necessarily perform well in an out-of-sample test
  • makes the Challenge more fun!

Understanding the data split

However, the leaderboard is only a small portion of the overall data. In fact, the observations (rows) in the data are split into:

  • 4/8 training data
  • 1/8 leaderboard data
  • 3/8 test data

As discussed in our blog post on evaluating submissions, final evaluation will be done on a separate set of held-out test data – the 3/8 portion referenced above. This means all awards (including the progress prizes) will be conducted on the test data, not the leaderboard. Likewise, our follow-up interviews will focus on the test set observations that were not used for training. Separation between the leaderboard and test sets is important; the leaderboard set isn’t truly held out since everyone receives repeated feedback from this set throughout the challenge!

Implications for strategy

What does this mean for your ideal strategy? How can you best make use of the leaderboard?

  • The leaderboard gives an instant snapshot of your out-of-sample performance. This can be useful in evaluating your model, much as splitting your own training set can be helpful.
  • However, over-fitting to the leaderboard will only hurt your score in the final test set evaluation
  • Leaderboard scores are noisy measures of generalization error because they are based on a small sample. So, even as a measure of generalization error, the leaderboard should be interpreted cautiously!

In summary, we expect some models to perform better in the final evaluation than the leaderboard suggests, due to random noise. Likewise, some models will look good on the leaderboard but perform poorly in the final evaluation because they got lucky in the leaderboard. Some submissions may even under-perform in the final evaluation because they made too many modeling adjustments to fit closely to idiosyncrasies of the leaderboard!

Your final evaluation will not be based on the leaderboard, so you are best advised to use it cautiously as one (noisy) bit of information about your generalization error.

Getting started workshop at PAA

Uncategorized No comments
featured image

The Fragile Families Challenge is excited to host a getting started workshop at the Annual Meeting of the Population Association of America in Chicago!

We will

  • Present a few slides introducing the Challenge (SLIDES HERE)
  • Provide food and a friendly collaborative environment
  • Work together to produce your first submission!

When: 10am – 2pm, Thursday, April 27
Where: Hilton Chicago, Conference Room 4G (DIRECTIONS: Come to the 4th floor and we’re the room way down at the end.)
Who: You! Anyone involved in social science and/or data science can make an important contribution.
RSVP: Mention you’re coming to our PAA workshop when you apply to participate!

We hope to see you there!

Reading survey documentation

Uncategorized No comments
featured image

The Fragile Families survey documentation can be confusing. We’ve put together this blog post so you can find out what variables in the Challenge data file mean.

Using the Fragile Families website

The first place to go to find out what a given variable represents is the Fragile Families and Child Wellbeing Study website: http://www.fragilefamilies.princeton.edu/

Once there, click the “Data and Documentation” tab.

This brings you to the main documentation for the full study. On the left, you will see a set of links that will take you to the documentation for particular waves of the data.

Clicking on the link for Year 9 (Wave 5) as an example, we see the following page of documentation for this survey.

Let’s look at the mother questionnaire and codebook. On page 5 of the questionnaire, you will see the following question:

In the corresponding codebook, we see the count of respondents who gave each answer:

Two things are worth noting here.

  1. The question referred to in the questionnaire as A3B is called m5a3b in the codebook. This is because the prefix “m5” indicates that this question comes from the mother wave 5 interview.
  2. Lot’s of people got coded -6 for “Skip.” Looking back at the questionnaire, we can see why they were skipped over this question: it was only asked of those for whom “PCG = NONPARENT AND RELATIONSHIP = FOSTER CARE.” For children not in foster care, this question would not be meaningful, so it wasn’t asked.

In general, the questionnaires are the best source for information about why certain respondents get skipped over questions. For more information on all the ways data can be missing, see our blog post on missing data.

Structure of the variable names

The general structure of the variable names is [prefix for questionnaire type][wave number][question number].

What are all the variable prefixes?

The most common prefixes are:

Prefix
Meaning
m
Mother
f
Father
h or hv
Home visit
p
Primary caregiver
k
Kid (interview with the child)
kind_
Kindergarten teacher
t
Teacher
ffcc_[something]
Child care surveys. For a full list of the [something] see this documentation.

Constructed variables: An additional prefix

Some variables have been constructed based on responses to several questions. These are often variable that are particularly relevant to the models many researchers want to estimate. These variables add the additional prefix c to the front of the variable name. For instance, cm1ethrace indicates constructed mother’s wave 1 race/ethnicity.

What are the wave numbers?

It’s easy to talk about the questionnaires by the rough child ages at which they were conducted. This is how the documentation website is organized. However, the variable names always refer to wave numbers, not child ages. It’s important not to get confused on this point. The table below summarizes the mapping between wave numbers and approximate child ages.

Wave number
Approximate child age
1
0, often called “baseline”
2
1
3
3
4
5
5
9

What are the question numbers?

Question numbers typically begin with a letter and a number, i.e. a3.

  • In questionnaires, questions are referred to by question number alone.
  • In codebooks, questions are referred to by a prefix and then a question number.

How do I find a question I care about?

You might want to find a particular question. For instance, when modeling eviction or material hardship at age 15, you might want to include the same measures collected at age 9. If you ctrl+F or cmd+F for “evicted” in the mother or father codebook or questionnaire at age 9, you will find these variables. In this case, they are m5f23d and f5f23d.

GPA

Uncategorized No comments
featured image

GPA measures academic achievement.

We want to know:

  • What helps disadvantaged children to beat the odds and succeed academically?
  • What derails children so that they perform unexpectedly poorly?

Survey question

How we cleaned the data

Our measure of GPA is self-reported by the child at approximately age 15. We marked as NA the GPAs of children who were not interviewed, reported no grade, refused to answer, did not know, or were homeschooled, for any of the four subjects. For children with valid answers, we averaged the responses for all four subjects, then subtracted this number from 5 to produce an estimate of child GPA ranging from 1 to 4. In our re-coded variable, a GPA of 4.0 indicates that the child reported straight As, while a GPA of 1.0 indicates that the child reported getting all grades of D or lower.

Distribution in the training set

Scientific motivation

Helping kids “beat the odds” academically is a fundamental goal of education research; academic success can be the key to breaking the cycle of poverty. Free public education is often referred to as a great equalizer, yet children who grow up in disadvantaged families consistently underperform their more affluent peers on average.

However, the average is not the whole story. Some kids do well despite being expected to do poorly. In fact, the amount of unexplained variation in educational achievement is enormous: social science models typically have R-squared values of 0.2 or less. The poor predictive performance of social science models of educational attainment has long been known. In the now-classic 1972 book Inequality: A Reassessment of the Effect of Family and Schooling in America, Harvard social scientist Christopher Jencks argued that random chance played a larger role than measured family background characteristics in determining socioeconomic outcomes.

While social scientists have learned some about what helps children succeed academically in the decades since 1972, a huge proportion of the variance remains buried in the error term of regression models. Is this term truly random chance, or is there “dark matter” out there in the form of unmeasured but important variables that help some kids to beat the odds?

By submitting a model for GPA at age 15, you help us in our quest to find this dark matter. Based on our collaborative model combining all of the individual submissions, we will identify our best guess as a scientific community about how children are expected to perform at age 15. Then, we will identify a subset of children performing much better and worse than expected. We will interview these children to answer the question: what unmeasured variables are common to the kids who are beating the odds, which we do not observe among the children who are struggling unexpectedly?

When you participate, you help us target interviews at the children whose outcomes are least well explained by our measured variables. These children are best-positioned for exploratory qualitative research to uncover unmeasured but important factors. Interviews may help us learn how some kids beat the odds, these results may drive future deductive research to evaluate the causal effect of these unmeasured variables, and ultimately we hope that policymakers can intervene on the “dark matter” we find in order to improve the lives of other disadvantaged children in the future.

Grit

Uncategorized No comments
featured image

Grit is a measure of passion and perseverance. It predicts success in many domains. The causes of grit remain unknown.

We want to know: What makes some kids unexpectedly grittier than others in adolescence?

Survey questions

The survey questions are adapted from the grit scale proposed by Duckworth, Peterson, Matthews, and Kelly (2007).

How we cleaned the data

Our measure of grit is based on the four questions above, as answered by the child at approximately age 15. These items were part of a longer battery of questions capturing a wider range of attitudes, emotions, and outlooks. Children who refused any of the four questions or didn’t know how to answer were coded as NA, as were children who did not complete the age 15 interview. For children with four valid answers, we averaged the answers and subtracted the result from 5. This created a continuous scale ranging from 1 to 4. The way we have recoded it, a high score on our variable indicates more grit.

Distribution in the training set

Scientific motivation

Do you keep working when the going gets tough? If so, you probably have a lot of grit.

University of Pennsylvania psychologist and MacArthur “Genius” award winner Angela Duckworth has found that grit predicts all kinds of measures of success: persistence through a military training program at West Point, advancement through the Scripps National Spelling Bee, and educational attainment, to name a few. Duckworth’s work has reached the general public through her TED talk and NY Times bestseller Grit: The Power of Passion and Perseverance.

While it is clear that grit predicts success, it is less clear what causes some people to be grittier than others. How can we help more disadvantaged children to exhibit grit?

A few researchers have begun to examine this question. In their book Coming of Age in the Other America, social scientists Stefanie DeLuca (Johns Hopkins University), Susan Clampet-Lundquist (St. Joseph’s University), and Kathryn Edin (Johns Hopkins University) argue that kids growing up in impoverished urban neighborhoods are often inspired to have grit when they develop passion for an “identity project”: a personal passion that gives them something to aspire toward beyond the challenges of the present day. This ethnographic work exemplifies how qualitative social science research may be able to uncover previously unmeasured sources of grit.

How much more could we learn if qualitative interviews were targeted at the kids best positioned to be informative about unmeasured sources of grit? By participating, you can help us build a community model for grit measured in adolescence. The combined submissions of all who participate will identify our common agreement about the amount of grit we expect to see in the Fragile Families respondents, given all of their childhood experiences from birth to age 9. By interviewing children who have much more or much less grit than we all expect, we will uncover unmeasured factors that predict grit. It is our hope that these unmeasured factors can inform future deductive evaluations and ultimately policy interventions to help kids break the cycle of poverty by developing grit.

Grit is an important predictors of success, but the causes of grit are largely unknown. Be part of the solution and help us target interviews toward those best positioned to show us these unmeasured sources of grit. Apply to participate, build a model, and upload your contribution.

Material hardship

Uncategorized No comments
featured image

Material hardship is a measure of extreme poverty.

We want to know:

  • What helps families to unexpectedly escape extreme poverty?
  • What leads families to fall into extreme poverty unexpectedly?

Survey questions

How we cleaned the data

These questions were asked of the child’s primary caregiver when the child was approximately age 15. We marked as NA material hardship for children whose caregivers did not participate in the survey, didn’t know the answer to one or more questions, or refused one or more questions. Our material hardship measure is the proportion of these 11 questions for which the child’s caregiver answered “Yes.” Material hardship ranges from 0 to 1, with higher values indicating more material hardship.

Distribution in the training set

Scientific motivation

In his 1964 State of the Union Address, President Lyndon B. Johnson declared an “all-out war on human poverty and unemployment in these United States.” In the decades since, America has taken great strides toward this goal. However, severe deprivation remains a problem today. In $2 a Day: Living on Almost Nothing in America, Johns Hopkins sociologist Kathryn Edin and University of Michigan social work professor H. Luke Schaefer bring us into the lives of American families living in the nightmare of extreme poverty.

What can be done to reduce extreme poverty? By identifying families who unexpectedly escape extreme poverty, as well as those who unexpectedly fall into it, we hope to uncover unmeasured but important factors that affect severe deprivation.

Measuring extreme poverty is hard. The material hardship scale was originally proposed in a 1989 paper by Susan Mayer and Christopher Jencks, then social scientists at Northwestern University. Rather than focusing solely on respondent’s incomes, Mayer and Jencks asked respondents about particular needs that they were unable to meet. This scale proved fruitful and captured a dimension of poverty above and beyond what was captured by income alone. With minor modifications, the material hardship scale became a standard measure in the federal Survey of Income and Program Participation (SIPP), and it has been included in several waves of the Fragile Families Study.

By participating, you help us to identify the level of material hardship that is expected at age 15 for each of the families in the Fragile Families Study. By combining all of the submissions in one collaborative model, we will produce the best guess by the scientific community of the experiences we expect for families at age 15. Undoubtedly, some families will report much more or much less material hardship than we expect. By interviewing these families, we hope to discover unmeasured but important factors that are associated with sudden dives into material hardship or unexpected recoveries.

The results of these exploratory interviews can then inform future deductive social science research and help us propose policies that could help families to escape severe deprivation. You can help us to target these interviews at the families best positioned to help. Be a part of the solution: apply to participate, build a model, and upload your contribution.

Eviction

Uncategorized No comments
featured image

Eviction is a traumatic experience in which families are forced from their homes for not paying the rent or mortgage.

We want to know: As children transition into adulthood, does eviction cause negative outcomes?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0. We additionally coded as 1 a small group of respondents who answered in a previous question that they were evicted in the past year, and thus were skipped over this question.

Distribution in the training set

Scientific motivation

In the New York Times bestseller Evicted: Poverty and Profit in the American City, Harvard sociologist and MacArthur “Genius” award winner Matthew Desmond describes fieldwork in which he spent several years living alongside tenants being evicted in low-income Milwaukee neighborhoods. Desmond helped tenants move their things into trucks, followed landlords into eviction court, and watched as children moved from school to school while their families searched for housing. Eviction literally uproots families from their homes, and it is most prevalent among the most disadvantaged urban families. Given Desmond’s qualitative account, it is plausible that eviction may have substantial negative effects on child outcomes in early adulthood.

Emerging evidence further suggests that eviction is sufficiently prevalent to warrant policy attention. Researchers at the Federal Reserve Bank of Atlanta have examined administrative records to find that 12.2 percent of rental households were evicted and forcibly displaced in 2015 in Fulton County, GA (Raymond et al. 2016). Likewise, the Milwaukee Area Renters Study found that 13 percent of private renters experienced a forced move during the 2 years referenced in a survey questionnaire (Desmond and Schollenberger 2015). If eviction creates disadvantage for children, it is sufficiently prevalent to have wide-reaching impacts.

However, untangling cause from selection is no simple task (see our blog post on causal inference and this interview with Matthew Desmond on the topic). It is easy to show that children who experience an eviction have worse outcomes later in life; it is hard to show that these outcomes are not caused by other factors that are correlated with eviction. In a quantitative study using propensity score matching methods on earlier waves of the Fragile Families and Child Wellbeing Study, Desmond and Kimbro (2015) find that eviction is associated with negative outcomes, net of obvious sources of selection bias.

We applaud the work of all the individual research teams that have placed eviction on the table as a scientific concept of interest. However, any individual research team can only adjust for a selected group of observed covariates, and results can be sensitive to the set chosen. We ask you to contribute a model for the probability that a child experiences an eviction between the age 9 and age 15 interviews of the Fragile Families and Child Wellbeing Study, given any set of the birth to age 9 characteristics you choose to include, and any statistical model you choose to employ. Together, we will produce a collaborative propensity score model that the entire scientific community can agree upon, which is not sensitive to researcher decisions. We will then interview a subset of children who are matched on the propensity score, to assess the plausibility of the conditional ignorability assumption required for causal inference (see our blog post on causal inference). If the interview suggest that causal inference may be warranted, we will use these collaborative propensity scores to estimate the causal effect of eviction on child outcomes to be measured several years from now, when children are approximately 22 years old.

In summary, this research agenda will produce estimates of the effect of adolescent eviction on attainment during the transition to adulthood. These collaborative estimates will be robust to the decisions of individual researchers. The assumptions needed for causal inference will be validated in qualitative interviews. These steps will maximize the validity of causal inference in the absence of a randomized experiment.

To achieve these goals, we need your help. Apply to participate, build a model, and upload your contribution!

Layoff

Uncategorized No comments
featured image

Being laid off is a sudden and often unexpected experience with potentially detrimental consequences for one’s family.

We want to know: When a caregiver is laid off, do adolescent children suffer collateral damage?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who have never worked or have not worked since the age 9 interview (in approximately the prior 6 years) were coded as NA; these respondents are not at risk for a layoff. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0.

Distribution in the training set

Scientific motivation

A steady jobs can provide financial security to a family. However, this security can be upset by plant closures, downsizing, and other economic shifts that lead caregivers to lose their jobs. In addition, some caregivers may be fired but report in a survey that they have been laid off. In any case, layoff of a caregiver could create dramatic disadvantages for adolescents nearing the transition to adulthood.

Social scientists worry about layoffs because precarious work is on the rise. In Good Jobs, Bad Jobs, University of North Carolina sociologist Arne L. Kalleberg outlines economic shifts that have made steady employment harder to come by in the United States over the past several decades. Gone are the days when workers could count on a single job to carry them throughout their careers – job changes and unexpected unemployment are now commonplace.

Social scientists also worry about layoffs because they may negatively influence child achievement. Sociologists Jennie E. Brand (UCLA) and Juli Simon Thomas (Harvard) have shown in an article published in the American Journal of Sociology that maternal job displacement reduces a child’s chances of high school and college completion by 3 – 5 percentage points, with even larger effects among those unlikely to experience job displacement and those whose mothers experienced job displacement while the child was an adolescent. When caregivers lose their jobs, children suffer collateral damage.

However, causal conclusions always depend on modeling assumptions. The propensity score matching methods used in the paper cited above assume that the model for the probability of job displacement is correctly specified, and that there are no unmeasured variables that affect job displacement and also directly affect child outcomes. To learn more on these assumptions, see our blog post on causal inference.

The Fragile Families Study follows a particularly disadvantaged sample of urban children, for whom we would especially like to know the effect of maternal layoff on adult outcomes. By participating, you help us to produce a collaborative propensity score model that combines the best of all the individual submissions into a single metric that is robust to the modeling decisions of individual researchers. This model will also help us target interviews at the children best positioned to lend suggestive evidence about the plausibility of the untestable conditional ignorability assumption required for causal inference. If this assumption seems credible after interviews, we will use our collaborative propensity scores to estimate the causal effect of caregiver layoff on child outcomes in early adulthood, once those outcomes are measured several years from now.

By participating, you can be part of an extending our body of knowledge to provide maximally robust causal evidence with observational data about the effect of caregiver layoffs on child outcomes in a disadvantaged urban sample. Results will inform policy changes about whether support for steady caregiver employment could help disadvantaged children.

Be a part of the solution. Apply to participate, build a model, and upload your contribution.

Job training

Uncategorized No comments
featured image

Policymakers often propose programs to retrain the workforce to be able to contribute in a 21st century economy.

We want to know: Do job skills programs utilized by caregivers yield collateral benefits for disadvantaged children?

Survey question

When children were about 15 years old, each child’s primary caregiver was asked the following question:

How we cleaned the data

Those who did not participate in the age 15 interview, as well as those who refused (-1) or didn’t know (-2), were coded as NA. Those who responded “Yes” were coded 1, and those who responded “No” were coded 0.

Distribution in the training set

Scientific motivation

One way to raise people’s standard of living is to raise their human capital: the skills that promote productive participation in the labor force. Human capital investments are perhaps more important now than ever before given rapid globalization and computerization of the economy. Does participation in job training programs designed to build computer, language, or other skills improve the well-being of families? When caregivers participate in these programs, do children benefit indirectly?

Social scientists have long been interested in policy interventions to promote employment. This research has also been closely tied to the development of statistical methods for causal inference with observational data. In the 1970s, the National Supported Work Demonstration (NSW) randomly assigned some disadvantaged, non-employed workers to a job training program that included guaranteed employment for a short period of time. Others were randomly assigned to a control condition. The treatment led to measurable increases in earnings in subsequent years, suggesting that job training might be useful.

University of Chicago economist Robert LaLonde saw a new use for these data. Given that experimental results provided the “true” causal effect of job training on earnings, LaLonde wanted to know whether econometric techniques that statistically adjust for selection bias could recover this “true” effect in a non-experimental setting. In general, these statistical adjustments failed to recapture the “true” effect, and LaLonde’s 1986 paper became highly cited as evidence of the extreme difficulty of drawing causal inferences from observational data.

However, the story did not end there. About the same time, a pair of statisticians developed a new method for identifying causal effects: propensity score matching. In an enormously influential 1983 paper, Paul R. Rosenbaum (then of the University of Wisconsin) and Donald B. Rubin (then of the University of Chicago) showed that the average causal effect of a binary treatment on an outcome could be identified by matching treated units with untreated units who had similar probabilities of treatment given observed pre-treatment characteristics. The Rosenbaum and Rubin theorem held only in a sufficiently large sample and only when one estimated the propensity score correctly without omitting any important variables that might affect the treatment and directly affect the outcome. Despite these limitations, the key idea stuck: under certain assumptions, one can use observational data to try to re-create the type of data one would get in a randomized experiment where background characteristics no longer determine treatment assignment.

Empowered with propensity scores, two other statisticians reassessed LaLonde’s findings: could propensity score methods recover the experimental benchmark in the job training example? Raheev H. Dehejia (then of Columbia University) and Sadek Wahba (then of Morgan Stanley) found that they could. In two highly-cited papers (paper 1 and paper 2), they demonstrated that propensity score methods came much closer to recovering the experimental truth than the econometric approaches used by LaLonde.

The saga of job training and causal inference has continued to the present day. For instance, a 2002 paper by economists Jeffrey Smith (then of the University of Maryland) and Petra Todd (University of Pennsylvania) demonstrated that propensity score methods can be highly sensitive to researcher decisions. Since then, numerous statisticians and social scientists have used the job training example to demonstrate the usefulness of new matching methods: entropy balancing (Hainmueller 2012), genetic matching (Diamond and Sekhon 2013), and the covariate balancing propensity score (Imai and Ratkovic 2014), to name a few.

Be part of the next step

Clearly there is a lot of interest in human capital formation through job training. There is also interest in methods to infer causal effects from observational data. How does the Fragile Families Challenge fit in?

A slightly different treatment

The LaLonde (1986) paper and subsequent studies focused on an intensive job training program that connected non-employed individuals with jobs. The “treatment” variable which you will predict is much milder: participation in any classes to improve job skills, such as computer training or literacy classes. Respondents who enroll in these classes are not necessarily non-employed.

A robust propensity score model

One piece of conventional wisdom about propensity score methods is that one should be careful about selecting the pretreatment variables to include in the model, and one must model their relationship to the treatment variable appropriately. This is where you can help! Together we will build a highly robust  community model for the probability of job training. This community model will take all of our best ideas and create one product on which we can all agree.

Specifying models before outcomes occur

A second piece of conventional wisdom of propensity score modeling is that it allows one to conduct all modeling and matching before even looking at the outcome variable. In our case, the ultimate outcome variables are not yet measured: we will examine the effect of caregiver job training on child outcomes in early adulthood. These outcomes will be measured several years from now, long after we lock in our community propensity score model.

Evaluating assumptions

All covariate-adjustment methods to draw causal inferences from observational data rely on the assumption of conditional ignorability (for more about this assumption, see our blog post about causal inference). Through targeted interviews with caregivers, we can provide suggestive evidence as to whether the conditional ignorability assumption holds.

You can help

Be a part of the next step in observational causal inference to evaluate the effect of job training programs. Apply to participate, build a model, and upload your contribution.

Blog posts

Uncategorized 3 comments

In addition to the general Fragile Families documentation, the following blog posts provide more details about the data and the scientific goals of the project.

Weekly office hours

Uncategorized No comments
featured image

From 3:30-4:30pm Eastern Daylight Time every Wednesday, one of us will be at the computer to answer your questions. At those times, please video call us via Google Hangout at fragilefamilieschallenge@gmail.com.

For more immediate feedback from the full community of users, post on our discussion forum for the Fragile Families Challenge.

For concerns you do not wish to share with the entire community, you can also contact us privately.

Discovering unmeasured factors

Uncategorized No comments
featured image

Beating the odds

Despite coming from disadvantaged backgrounds, some kids manage to “beat the odds” and achieve unexpectedly positive outcomes. Meanwhile, other kids who seem on track sometimes struggle unexpectedly. Policymakers would like to know what variables are associated with “beating the odds” since this could generate new theories about how to help future generations of disadvantaged children.

Once we combine all of the submissions to the Fragile Families Challenge into one collaborative guess for how children will be doing on each outcome at age 15, we will identify a small number of children doing much better than expected (“beating the odds”), and another set who are doing much worse than expected (“struggling unexpectedly”). By interviewing these sets of children, we will be well-positioned to learn what factors were associated with who ended up in each group.

What we learn in these interviews will affect the questions asked in future waves of the Fragile Families Study, and possibly other studies like it. By combining quantitative models with inductive interviews, the Fragile Families Challenge offers a new way to improve surveys in the future and expand the range of social science theories. In the remainder of this blog, we discuss current approaches to survey design and the potential contribution of the Fragile Families Challenge.

Deductive survey design: Evaluating theories

Social scientists often design surveys using deductive approaches based on theoretical perspectives. For instance, economists theorize about how one’s employment depends on the hypothetical wage offer (often called a “reservation wage”) one would have to be given before one would leave other unpaid options behind and opt into paid labor. Motivated by this theoretical perspective, Fragile Families and other surveys have incorporated questions like: “What would the hourly wage have to be in order for you to take a job?”

However, even the best theoretically-informed social science measures perform poorly at the task of predicting outcomes. R-squared, a measure of a model’s predictive validity, often ranges from 0.1 to 0.3 in published social science papers. Simply put, a huge portion of the variance in outcomes we care about is unexplained by the predictors social scientists have invented and put their faith in.

Inductive interviews: A source of new hypotheses

How can we be missing so much? Part of the problem might be that academics who propose these theoretical perspectives often spend their lives far from the context in which the data are actually collected. An alternative, inductive approach is to conduct open-ended interviews with interesting cases and allow the theory to emerge from the data. This approach is often used in ethnographic and other qualitative work, and points researchers toward alternative perspectives they never would have considered on their own.

Inductive approaches have their drawbacks: researchers might develop a theory that works well for some children, but does not generalize to other cases. Likewise, the unmeasured factors we discover will not necessarily be causal. However, inductive interviews will generate hypotheses that can be later evaluated using deductive approaches in new datasets, and finally evaluated with randomized controlled trials.

An ideal combination: Cycling between the two

To our knowledge, the Fragile Families Challenge is the first attempt to cycle between these two approaches. The study was designed with deductive approaches: researchers asked questions based on social science theories about the reproduction of disadvantage. However, we can use qualitative interviews to inductively learn new variables that ought to be collected. Finally, we will incorporate these variables in future waves of data collection to deductively evaluate theories generated in the interviews, using out-of-sample data.

By participating in the Fragile Families Challenge, you are part of a scientific endeavor to create the surveys of the future.

Missing data

Uncategorized 1 comment
featured image

This blog post

  1. discusses how missing data is coded in the Fragile Families study
  2. offers a brief theoretical introduction to the statistical challenges of missing data
  3. links to software that implements one solution: multiple imputation

Of course, you can use any strategy you want to deal with missing values: multiple imputation is just one strategy among many.

Missing data in the Fragile Families study

Missing data is a challenge in almost all social science research. It generally comes in two forms:

  1. Item non-response: Respondents simply refuse to answer a survey question.
  2. Survey non-response: Respondents cannot be located or refuse to answer any questions in an entire wave of the survey.

While the first problem is common in any dataset, the second is especially prevalent in panel studies like Fragile Families, in which the survey is composed of interviews conducted at various child ages over the course of 15 years.

While the survey documentation details the codes for each variable, a few global rules summarize the way missing values are coded in the data. The most common responses are bolded.

  • -9 Not in wave – Did not participate in survey/data collection component
  • -8 Out of range – Response not possible; rarely used
  • -7 Not applicable (also -10/-14) – Rarely used for survey questions
  • -6 Valid skip – Intentionally not asked question; question does not apply to respondent or response known based on prior information.
  • -5 Not asked “Invalid skip” – Respondent not asked question in the version of the survey they received.
  • -3 Missing – Data is missing due to some other reason; rarely used
  • -2 Don’t know – Respondent asked question; Responded “Don’t Know”.
  • -1 Refuse – Respondent asked question; Refused to answer question

When responses are coded -6, you should look at the survey questionnaire to determine the skip pattern. What did these respondents tell us in prior questions that caused the interviewer to skip this question? You can then decide the correct way to code these values given your modeling approach.

When responses are coded -9, you should be aware that many questions will be missing for this respondent because they missed an entire wave of the survey.

For most other categories, an algorithmic solution as described below may be reasonable.

Theoretical issues with missing data

Before analyzing data with missing values, researchers must make assumptions about how some data came to be missing. One of the most common assumptions is the assumption that data are missing at random. For this assumption to hold, the pattern of missingness must be a function of the other variables in the dataset, and not a function of any unobserved variables once those observed are taken into account.

For instance, suppose children born to unmarried parents are less likely to be interviewed at age 9 than those born to married parents. Since the parents’ marital status at birth is a variable observed in the dataset, it is possible to adjust statistically for this problem. Suppose, on the other hand, that some children miss the age 9 interview because they suddenly had to leave town to attend the funeral of a their second cousin once removed. This variable is not in the dataset, so no statistical adjustment can fully account for this kind of missingness.

For a full theoretical treatment, we recommend

One solution: Imputation

Once we assume that data are missing at random, a valid approach to dealing with the missing data is imputation. This is a procedure whereby the researcher estimates the association between all of the variables in the model, then fills in (“imputes”) reasonable guesses for the values of the missing variables.

The simplest version of imputation is known as single imputation. For each missing value, one would use an algorithm to guess the correct value for every missing observation. This produces one complete dataset, which can be analyzed like any other. However, single imputation fails to account for our uncertainty about the true values of the missing cases.

Multiple imputation is a procedure that produces several data sets (often in the range of 5, 10, or 30), with slightly different imputed values for the missing observations in each data set. Differences across the datasets capture our uncertainty about the missing values. One can then estimate a model on each imputed dataset, then combine estimates across the imputed datasets using a procedure known as Rubin’s rules.

Ideally, one would conduct multiple imputation on a dataset with all of the observed variables. In practice, this can become computationally intractable in a dataset like Fragile Families with thousands of variables. In practice, researchers often select the variables to be included in their model, restrict the data to only those variables, and then multiply impute missing values in this subset.

Implementing multiple imputation

There are many software packages to implement multiple imputation. A few are listed below.

In R, we recommend Amelia (package home, video introduction, vignette, documentation) or MICE (package home, introductory paper, documentation). Depending on your implementation, you may also need mitools (package home,vignette, documentation) or Zelig (website) to combine estimates from several imputed datasets.

In Stata, we recommend the mi set of functions as described in this tutorial.

In SPSS, we recommend this tutorial.

In SAS, we recommend this tutorial.

This set is by no means exhaustive. One curated list of software implementations is available here.

Evaluating submissions

Uncategorized No comments
featured image

We will evaluate submissions based on predictive validity, measured in the held-out test data by mean squared error loss for continuous outcomes and Brier loss for binary outcomes.

A leaderboard will rank submissions according to these criteria, using a set of held-out data. After the challenge closes, we will produce a finalized ranking of submissions based on a separate set of withheld true outcome data.

Each of the 6 outcomes will be evaluated and ranked independently – feel free to focus on predicting one outcome well!

What does this mean for you?

You should produce a submission that performs well out of sample. Mean squared error is a function of both bias and variance. A linear regression model with lots of covariates is an unbiased predictor, but it might overfit the data and produce predictions that are highly sensitive to the sample used for training. Computer scientists often refer to this problem as the challenge of distinguishing the signal from the noise; you want to pick up on the signal in the training data without picking up on the noise.

An overly simple model will fail to pick up on meaningful signal. An overly complex model will pick up too much noise. Somewhere in the middle is a perfect balance – you can help us find it!

Causal inference

Uncategorized No comments
featured image

The Fragile Families Challenge presents a unique opportunity to probe the assumptions required for causal inference with observational data. This post introduces these assumptions and highlights the contribution of the Fragile Families Challenge to this scientific question.

 

Causal inference: The problem

Social scientists and policymakers often wish to use empirical data to infer the causal effect of a binary treatment D on an outcome Y. The causal effect for each respondent is the potential outcome that each observation would take under treatment (denoted Y(1)) minus the potential outcome that each observation would take under control (denoted Y(0)). However, we immediately run into the fundamental problem of causal inference: each observation is observed either under the treatment condition or under the control condition.

 

The solution: Assumptions of ignorability

The gold standard for resolving this problem is a randomized experiment. By randomly assigning treatment, researchers can ensure that the potential outcomes are independent of treatment assignment, so that the average difference in outcomes between the two groups can only be attributable to treatment. This assumption is formally called ignorability.

Ignorability: {Y(0),Y(1)} 丄 D

Because large-scale experiments are costly, social scientists frequently draw causal inferences from observational data based on a simplifying assumption of conditional ignorability.

Conditional ignorability: {Y(0),Y(1)} 丄 D | X

Given a set of covariates X, conditional ignorability states that treatment asignment D is independent of the potential outcomes that would be realized under treatment Y(1) and control Y(0). In other words, two observations with the same set of covariates X but with different treatment statuses can be compared to estimate the causal effect of the treatment for these observations.

 

Assessing the credibility of the ignorability assumption

Conditional ignorability is an enormous assumption, yet it is what the vast majority of social science findings rely on. By writing the problem in a Directed Acyclic Graph (DAG, Pearl 2000), we can make the assumption more transparent.

X represents pre-treatment confounders that affect both the treatment and the outcome. Though it is not the only way to do so, researchers often condition on X by estimating the probability of treamtent given X, denoted P(T | X). Once we account for the differential probability of a treatment by the background covariates (through regression, matching, or some other method), we say we have blocked the noncausal backdoor paths connecting T and Y through X.

The key assumption in the left panel has to do with Ut. We assume that all unobserved variables that affect the treatment (Ut) have no affect on the outcome Y, except through T. This is depicted graphically by the dashed line from Ut to Y, which we must assume does not exist for causal inferences to be valid.

Researchers often argue that conditional ignorability is a reasonable assumption if the set of predictors included in X is extensive and detailed. The Fragile Families Challenge is an ideal setting in which to test the credibility of this assumption: we have a very detailed set of predictor variables X collected from birth through age 9, which occur temporally prior to treatments reported at age 15.

Nevertheless, the assumption of conditional ignorability is untestable. Interviews may provide some insight to the credibility of this assumption.

 

Goal of the Fragile Families Challenge: Targeted interviews

Through targeted interviews with particularly informative children, we might be able to learn something about the plausibility of the conditional ignorability assumption.

One of the binary variables in the Fragile Families Challenge is whether a child was evicted from his or her home. We will treat this variable as T. We want to know the causal effect of eviction on a child’s chance of graduating from high school (Y). In the Fragile Families Challenge, the set of observed covariates X is all 12,000+ predictor variables included in the Fragile Families Challenge data file.

Based on the ensemble model from the Fragile Familie Challenge, we will identify 20 children who were evicted, and 20 similar children who had similar predicted probabilities of eviction but were not evicted. We will interview these children to find out why they were evicted.

 

Potential interviews in support of conditional ignorability:

Suppose we find that children were evicted because their landlords were ready to retire and wanted to get out of the housing market. Those who were not evicted had younger landlords. It might be plausible that the age of one’s landlord is an example of Ut: a variable that affects eviction but has no effect on high school graduation except through eviction. While this would not prove the conditional ignorability assumption, the assumption might seem reasonable in this case.

 

Potential interviews that discredit conditional ignorability:

Suppose instead that we find a different story. Gang activity increased in the neighborhoods of some families, escalating to the point that landlords decided to get out of the business and evict all of their tenants. Other families lived in neighborhoods with no gang activity, and they were not evicted. In addition to its effect on eviction, it is likely that gang activity would alter the chances of high school graduation in other ways, such as by making students feel unsafe at school. In this example, gang activity plays the role of Uty and would violate the assumption of conditional ignorability.

 

Summary

Because costs prohibit randomized experiments to evaluate all potential treatments of interest to social scientists, scholars frequently rely on the assumption of conditional ignorability to draw causal claims from observational data. This is a strong and untestable assumption. The Fragile Families Challenge is a setting in which the assumption may be plausible, due to the richness of the covariate set X, which includes over 12,000 pre-treatment variables chosen for their potentially important ramifications for child development.

By interviewing a targeted set of children chosen by ensemble predictions of the treatment variables, we will shed light on the credibility of the ignorability assumption.