## Causal inference

The Fragile Families Challenge presents a unique opportunity to probe the assumptions required for causal inference with observational data. This post introduces these assumptions and highlights the contribution of the Fragile Families Challenge to this scientific question.

**Causal inference: The problem**

Social scientists and policymakers often wish to use empirical data to infer the causal effect of a binary treatment *D* on an outcome *Y*. The causal effect for each respondent is the potential outcome that each observation would take under treatment (denoted Y(1)) minus the potential outcome that each observation would take under control (denoted Y(0)). However, we immediately run into the **fundamental problem of causal inference**: each observation is observed either under the treatment condition or under the control condition.

**The solution: Assumptions of ignorability**

The gold standard for resolving this problem is a **randomized experiment**. By randomly assigning treatment, researchers can ensure that the potential outcomes are independent of treatment assignment, so that the average difference in outcomes between the two groups can only be attributable to treatment. This assumption is formally called **ignorability**.

Ignorability: {Y(0),Y(1)} 丄 D

Because large-scale experiments are costly, social scientists frequently draw causal inferences from observational data based on a simplifying assumption of **conditional ignorability**.

Conditional ignorability: {Y(0),Y(1)} 丄 D | X

Given a set of covariates *X*, conditional ignorability states that treatment asignment *D* is independent of the potential outcomes that would be realized under treatment *Y(1)* and control *Y(0)*. In other words, two observations with the same set of covariates *X* but with different treatment statuses can be compared to estimate the causal effect of the treatment for these observations.

**Assessing the credibility of the ignorability assumption**

Conditional ignorability is an enormous assumption, yet it is what the vast majority of social science findings rely on. By writing the problem in a Directed Acyclic Graph (DAG, Pearl 2000), we can make the assumption more transparent.

*X* represents pre-treatment **confounders** that affect both the treatment and the outcome. Though it is not the only way to do so, researchers often condition on *X* by estimating the probability of treamtent given *X*, denoted *P*(*T | X*). Once we account for the differential probability of a treatment by the background covariates (through regression, matching, or some other method), we say we have blocked the noncausal backdoor paths connecting *T* and *Y* through *X*.

The key assumption in the left panel has to do with *Ut*. We assume that all unobserved variables that affect the treatment (*Ut*) have no affect on the outcome *Y*, except through *T*. This is depicted graphically by the dashed line from *Ut* to *Y*, which we must assume does not exist for causal inferences to be valid.

Researchers often argue that conditional ignorability is a reasonable assumption if the set of predictors included in *X* is extensive and detailed. The Fragile Families Challenge is an ideal setting in which to test the credibility of this assumption: we have a very detailed set of predictor variables *X *collected from birth through age 9, which occur temporally prior to treatments reported at age 15.

Nevertheless, the assumption of conditional ignorability is untestable. Interviews may provide some insight to the credibility of this assumption.

**Goal of the Fragile Families Challenge: Targeted interviews**

Through targeted interviews with particularly informative children, we might be able to learn something about the plausibility of the conditional ignorability assumption.

One of the binary variables in the Fragile Families Challenge is whether a child was evicted from his or her home. We will treat this variable as T. We want to know the causal effect of eviction on a child’s chance of graduating from high school (Y). In the Fragile Families Challenge, the set of observed covariates X is all 12,000+ predictor variables included in the Fragile Families Challenge data file.

Based on the ensemble model from the Fragile Familie Challenge, we will identify 20 children who were evicted, and 20 similar children who had similar predicted probabilities of eviction but were not evicted. We will interview these children to find out why they were evicted.

**Potential interviews in support of conditional ignorability:**

Suppose we find that children were evicted because their landlords were ready to retire and wanted to get out of the housing market. Those who were not evicted had younger landlords. It might be plausible that the age of one’s landlord is an example of Ut: a variable that affects eviction but has no effect on high school graduation except through eviction. While this would not prove the conditional ignorability assumption, the assumption might seem reasonable in this case.

**Potential interviews that discredit conditional ignorability:**

Suppose instead that we find a different story. Gang activity increased in the neighborhoods of some families, escalating to the point that landlords decided to get out of the business and evict all of their tenants. Other families lived in neighborhoods with no gang activity, and they were not evicted. In addition to its effect on eviction, it is likely that gang activity would alter the chances of high school graduation in other ways, such as by making students feel unsafe at school. In this example, gang activity plays the role of *Uty* and would violate the assumption of conditional ignorability.

**Summary**

Because costs prohibit randomized experiments to evaluate all potential treatments of interest to social scientists, scholars frequently rely on the assumption of conditional ignorability to draw causal claims from observational data. This is a strong and untestable assumption. The Fragile Families Challenge is a setting in which the assumption may be plausible, due to the richness of the covariate set *X*, which includes over 12,000 pre-treatment variables chosen for their potentially important ramifications for child development.

By interviewing a targeted set of children chosen by ensemble predictions of the treatment variables, we will shed light on the credibility of the ignorability assumption.

## Add your comment