This blog post
- discusses how missing data is coded in the Fragile Families study
- offers a brief theoretical introduction to the statistical challenges of missing data
- links to software that implements one solution: multiple imputation
Of course, you can use any strategy you want to deal with missing values: multiple imputation is just one strategy among many.
Missing data in the Fragile Families study
Missing data is a challenge in almost all social science research. It generally comes in two forms:
- Item non-response: Respondents simply refuse to answer a survey question.
- Survey non-response: Respondents cannot be located or refuse to answer any questions in an entire wave of the survey.
While the first problem is common in any dataset, the second is especially prevalent in panel studies like Fragile Families, in which the survey is composed of interviews conducted at various child ages over the course of 15 years.
While the survey documentation details the codes for each variable, a few global rules summarize the way missing values are coded in the data. The most common responses are bolded.
- -9 Not in wave – Did not participate in survey/data collection component
- -8 Out of range – Response not possible; rarely used
- -7 Not applicable (also -10/-14) – Rarely used for survey questions
- -6 Valid skip – Intentionally not asked question; question does not apply to respondent or response known based on prior information.
- -5 Not asked “Invalid skip” – Respondent not asked question in the version of the survey they received.
- -3 Missing – Data is missing due to some other reason; rarely used
- -2 Don’t know – Respondent asked question; Responded “Don’t Know”.
- -1 Refuse – Respondent asked question; Refused to answer question
When responses are coded -6, you should look at the survey questionnaire to determine the skip pattern. What did these respondents tell us in prior questions that caused the interviewer to skip this question? You can then decide the correct way to code these values given your modeling approach.
When responses are coded -9, you should be aware that many questions will be missing for this respondent because they missed an entire wave of the survey.
For most other categories, an algorithmic solution as described below may be reasonable.
Theoretical issues with missing data
Before analyzing data with missing values, researchers must make assumptions about how some data came to be missing. One of the most common assumptions is the assumption that data are missing at random. For this assumption to hold, the pattern of missingness must be a function of the other variables in the dataset, and not a function of any unobserved variables once those observed are taken into account.
For instance, suppose children born to unmarried parents are less likely to be interviewed at age 9 than those born to married parents. Since the parents’ marital status at birth is a variable observed in the dataset, it is possible to adjust statistically for this problem. Suppose, on the other hand, that some children miss the age 9 interview because they suddenly had to leave town to attend the funeral of a their second cousin once removed. This variable is not in the dataset, so no statistical adjustment can fully account for this kind of missingness.
For a full theoretical treatment, we recommend
One solution: Imputation
Once we assume that data are missing at random, a valid approach to dealing with the missing data is imputation. This is a procedure whereby the researcher estimates the association between all of the variables in the model, then fills in (“imputes”) reasonable guesses for the values of the missing variables.
The simplest version of imputation is known as single imputation. For each missing value, one would use an algorithm to guess the correct value for every missing observation. This produces one complete dataset, which can be analyzed like any other. However, single imputation fails to account for our uncertainty about the true values of the missing cases.
Multiple imputation is a procedure that produces several data sets (often in the range of 5, 10, or 30), with slightly different imputed values for the missing observations in each data set. Differences across the datasets capture our uncertainty about the missing values. One can then estimate a model on each imputed dataset, then combine estimates across the imputed datasets using a procedure known as Rubin’s rules.
Ideally, one would conduct multiple imputation on a dataset with all of the observed variables. In practice, this can become computationally intractable in a dataset like Fragile Families with thousands of variables. In practice, researchers often select the variables to be included in their model, restrict the data to only those variables, and then multiply impute missing values in this subset.
Implementing multiple imputation
There are many software packages to implement multiple imputation. A few are listed below.
In R, we recommend Amelia (package home, video introduction, vignette, documentation) or MICE (package home, introductory paper, documentation). Depending on your implementation, you may also need mitools (package home,vignette, documentation) or Zelig (website) to combine estimates from several imputed datasets.
In Stata, we recommend the mi set of functions as described in this tutorial.
In SPSS, we recommend this tutorial.
In SAS, we recommend this tutorial.
This set is by no means exhaustive. One curated list of software implementations is available here.