We will be presented three papers about the Fragile Families Challenge at the 2018 American Sociological Association Annual Meeting, which will be in Philadelphia August 11–14. Please come to these sessions to learn more about the Challenge.
Data-driven Data Provision: A Case Study from the Fragile Families Challenge
Sun, August 12, 8:30 to 10:10am, Philadelphia Marriott Downtown, Level 4, 404
Metadata provides critical support for researchers working with public datasets, but new methods at times outgrow what existing data infrastructure is able to support. This paper describes what happened when a large, heterogeneous group of researchers used a complex social data set in a way that was not originally envisioned by its creators. Using the Fragile Families Challenge as a case study, we identify five strategic areas where improving metadata — variable names, response codes, cross-questionnaire matching, concept tags, and release format — can make data use easier for everyone. More generally, we illustrate some of the unintentional and invisible barriers that are preventing the use of machine learning methods in the social sciences, and suggest that data system design is a fundamental research problem for the field of computational social science.
The Fragile Families Challenge: Predictability of Family and Child Well-being in Adolescence
Sun, August 12, 10:30am to 12:10pm, Philadelphia Marriott Downtown, Level 4, Franklin Hall 8
Scholars have long hypothesized that childhood experiences play an important role in the process by which socioeconomic status is reproduced across generations. The predictive power of attainment models, however, has been so weak that pioneers of the field have commented that random chance must play an important role. We hypothesize another possible source of poor predictive performance: untapped modeling potential. Modern machine learning approaches often yield better predictions than parametric regression models, yet social scientists have not fully exploited this opportunity. In this paper, we report on how 159 research teams from 68 institutions in 7 countries used rich survey data covering 2,121 training observations on 12,942 variables to produce predictive models that together set a benchmark of predictive performance for outcomes identified by social scientists as important factors in the status attainment process. We narrow our focus to a critical point of the life course: predicting adolescent outcomes as a function of childhood experiences. Each team developed a predictive model that was evaluated on a set of outcome observations available only to the organizers. Results suggest that (a) predictive performance outpaced approaches more common in social science, but (b) overall predictive performance was poor. We close with a discussion of the potential reasons for poor predictive performance in social science research. Given the theoretical importance of childhood experiences in the process of stratification, our results should be of interest to scholars of stratification, socio-economic mobility, child development, and statistical methods.
Privacy, Ethics, and Computational Social Science: A Case Study of the Fragile Families Challenge
Mon, August 13, 4:30 to 6:10pm, Philadelphia Marriott Downtown, Level 4, Franklin Hall 12
New sources of “big data” created by companies and governments hold great promise for advancing social science. Unfortunately, a fundamental barrier preventing researchers from achieving this promise is data access. Quite simply, most big data sources are not accessible to researchers. Therefore, developing procedures that enable safe and ethical data access represent an important methodological problem in computational social science. In this paper, we present our process for enabling data access during the Fragile Families Challenge, a scientific mass collaboration designed to improve the lives of disadvantaged children in the United States. We describe our process of threat modeling, threat mitigation, and third-party oversight. We also describe the ethical principles that formed the basis of our process. Ultimately, we hope that the approach that we developed will be helpful to researchers who seek data access and data custodians who wish to provide data access.