Anyone who uses survey data for research purposes knows how important metadata is for developing an understanding of a dataset’s structure and meaning. One of the big things we learned from organizing the Challenge is that machine learning methods place an extraordinary demand on metadata. Using 10k variables in a single model requires new ways of reading and using metadata to accomplish necessary data preparation tasks, and many of these tasks are not easily accomplished using the metadata infrastructure that is most commonly available in the social sciences (e.g. PDF codebooks).
To summarize how we’ve tried to improve these resources and what we learned as we undertook our redesign, we wrote a paper that will appear in a forthcoming special issue of Socius about the Fragile Families Challenge. We provide a link to the paper (on SocArXiv) as well as its abstract below; any comments or questions are most welcome!
Improving metadata infrastructure for complex surveys:
Insights from the Fragile Families Challenge
Abstract: Researchers rely on metadata systems to prepare data for analysis. As the complexity of datasets increases and the breadth of data analysis practices grow, existing metadata systems can limit the efficiency and quality of data preparation. This article describes the redesign of a metadata system supporting the Fragile Families and Child Wellbeing Study based on the experiences of participants in the Fragile Families Challenge. We demonstrate how treating metadata as data—that is, releasing comprehensive information about variables in a format amenable to both automated and manual processing—can make the task of data preparation less arduous and less error-prone for all types of data analysis. We hope that our work will facilitate new applications of machine learning methods to longitudinal surveys and inspire research on data preparation in the social sciences. We have open-sourced the tools we created so that others can use and improve them.