Month January 2018

Month January 2018

computational reproducibility and the Fragile Families Challenge special issue

Uncategorized No comments

We are currently editing a special issue of Socius about the Challenge. For this special issue, we are striving for a standard of computational reproducibility, which means that other researchers should be about to recreate the results in all of the papers. Therefore, while the manuscripts have been undergoing peer review, we have also been attempting to replicate the results in each paper. This has turned out to be trickier than we expected. In this post, I’d like to briefly summarize what we’ve done so far, and then share a set of guidelines that we’ve developed and shared with our authors. If you have ideas for how these guidelines can be improved, please let us know. Ultimately, we hope that the guidelines will be a helpful resource for authors and editors who wish to promote computational reproducibility, either in their own work or the work of others.

Our replication efforts have been spearheaded by David Liu, and this work will be part of his senior thesis in Princeton’s Department of Computer Science. In attempting to replicate the results of each paper, David has noticed helpful things that some authors have done, and he’s found some problems that come up over and over. Therefore, when we sent back decisions on the manuscripts, we also sent the feedback below on code. Just as authors have to revise and resubmit their manuscripts, for the special issue, authors will have to revise and submit their code. These guidelines are intended to help with that process.

Background behind reproducibility guidelines

First, we’d like to step back from the details to describe the high-level goal. We want your articles to be computationally reproducible, which means that another researcher could regenerate the results in your paper using the Challenge data, your code, and any additional data that you have created. Computational reproducibility will increase the impact of your work individually, and it will increase the contribution of the Challenge collectively.

As we’ve learned during this first round of reviews, the goal of computational reproducibility is widely shared by scientists, easy to state, and tricky to achieve. Based on what we’ve learned from your code, our thinking on how to achieve this goal issues has evolved. In particular, we’ve been very influenced by the idea of a “research pipeline” described by Peng and Eckel (2014), which is nicely captured by this figure: http://bit.ly/2qrTWXK.

The goal of this document is to provide you with guidelines that support computational reproducibility of your entire research pipeline, which goes from raw data to final output. You don’t have to follow these guidelines exactly; if you devise a system that you think is better, you are welcome to use it. But, if you have no system in place, we are going to strongly encourage that you adopt these guidelines.

The Guidelines

The most important thing to keep in mind is that we are asking you to create one single script named “run_all” that executes all the necessary files to go from the raw data to the final results. One way to do this is to write a bash script that calls the submission files in sequence. An example of a simple bash script is shown below:

Running the above script will execute each line, one after another. Note that the screen shot includes examples for many common languages. More background information on writing bash scripts is available at: https://ryanstutorials.net/bash-scripting-tutorial/bash-script.php. Of course, you may write the run_all script in the language of your choice so long as it can be executed from the command line.

While you are creating this script, we think it will be helpful to organize your input files, intermediate files, and output files into a standard directory (i.e., folder) structure. We think that this structure would help you create a modular research pipeline; see Peng and Eckel (2014) for more on the modularity of a research pipeline. This modular pipeline will make it easier for us to ensure computational reproducibility, and it will make it easy for other researchers to understand, re-use, and improve your code.

Here’s a basic structure that we think might work well for this project and others:

data/
code/
output/
README
LICENSE

In the data/ directory you can include:

  • background.csv (this should not actually be included because of privacy constraints, but we will put it here)
  • train.csv (this should not actually be included because of privacy constraints, but we will put it here)
  • Supplemental materials such as metadata files, the constructed-data dictionary, the machine-readable codebook.
  • Data that you have collected or created, such as a csv file that you manually created that has your MSE scores on the holdout data and/or an analytic dataset created by your code.

In your code/ directory you can include:

  • Executable “run all” script that when run goes from raw inputs all the way to final outputs (for this script we encourage you to think about the research pipeline idea from Peng and Eckel 2014: http://bit.ly/2qrTWXK)
  • Source code files each with a useful header (see FAQ).
  • Package requirements

In your output/ directory you can include:

  • prediction.csv
  • A subdirectory for tables
  • A subdirectory for figures (we also recommend including all data files that can be used to recreate the figures; see rule 7 of Sandve et al. 2013)

In addition to these three main directories, you should also include a README file and LICENSE file. We have more information about these files in the FAQ below. We hope that these guidelines are help, and please let us know if you have any questions.

Code Resubmission Process

Once you think you are ready to resubmit, here’s a checklist that you can follow to help ensure that your work will be computationally reproducible:

  • I have written the kind of README file that I would like to read (see FAQ below)
  • Each code file that I’ve written has a header that will be helpful (see FAQ below)
  • I’ve run the submission and I can get from raw files to final output using only materials in my directories. Then, I’ve done this again and I get the same result. This second step helps check for problems with seeding.
  • I’ve considered refactoring my code (see FAQ below)

Finally, when you resubmit, we ask that you include a revision memo about the code, just as you will about the manuscript. This revision memo should summarize changes that you have made. In this revision memo, please also include a rough estimate of the cumulative amount of time it took you to comply with these guidelines. We are asking for this time estimate because one objection to computational reproducibility is that it is too burdensome for authors and we would like to assess this empirically. Finally, please include any suggestions for how this process could have been easier or more efficient.

F.A.Q.

What should go in the README file?

The README file should provide an overview of your code. For example, it could include a diagram showing the different pieces of their code, their inputs and their outputs. If relevant, please include expected warnings when executing the code. Mention any provided “intermediate results” readers can utilize to decompose the submission into smaller pieces.

The README should also include something about your computing environment and expected run time; general terms are appropriate here. For example: “I ran this on a modern laptop (circa 2016) and it ran in a few minutes.” or “This code ran on high-performance cluster and took one week.” Finally, please clearly cite any open sourced content utilized in the submission, such as resources shared in the FFC blog or more general packages distributed in the computation community.

What headers should be included at the top of each piece of code?

Based on the ideas in Nagler (1995), we think the following elements should be included at the top of each piece of code:

  • Purpose (in 140 characters or less)
  • Inputs
  • Outputs
  • Machine used (e.g., laptop, desktop, cluster)
  • Expected runtime (e.g., seconds, minutes, hours, days, etc)
  • Set the seed at the beginning of each file (see rule 6 of Sandve et al. 2013)
  • All the package include statements (e.g., “library(ggplot2)” in R)

If you would like to deviate from this standard, please contact us.

How can I make my code easier to read?

It is hard to offer general advice, but one thing that we can recommend is at the end of the process take some time to refactor your code (https://en.wikipedia.org/wiki/Code_refactoring). In our experience, code evolves over the course of a project, and at the end it can be helpful to refactor in order to clean up the structure, improve variable names, and promote modularity.

Even if you don’t refactor your code, please include additional comments to helper functions and code segments that may be obscure to new readers.

What is our standard for computational reproducibility for the special issue?

Our standard for computational reproducibility for this special issue is that we should be able to take whatever code and data you submit, add the Fragile Families Challenge data file, and then reproduce all of the figures in your paper, all of the tables in your paper, and your predictions.csv file.

What is not included in our standard for computational reproducibility for the special issue?

We will not attempt to completely recreate your analysis from the written materials. Also, we will not verify that your description in the paper matches the code. For example, if the paper says that you use logistic regression to generate your predictions, we will not verify that the code also uses logistic regression. Further, we will not verify the information that you have provided from external sources. For example, if you write in the paper that your submission was 10th on the leaderboard, we will not verify this fact. Finally, we will not verify any of the numbers that are included in the text of the manuscript. For example, we would not verify a claim in the text such as: dropping variables with no variation removes 10% of variables. As we hope this list illustrates, our standard of computational reproducibility is in fact quite limited.

What license should I use?

We strongly recommend the MIT license. You can find it here: https://opensource.org/licenses/MIT. Simply replace with 2018 and with the name of all co-authors of the paper, in the order they are listed in the paper. If you would like to use some other license, please contact us.

What should I read to learn more about computational reproducibility?

Here’s a partial list. If we’ve left off a good resource, please let us know (fragilefamilieschallenge@gmail.com).

Nagler (1995) “Coding Style Good Computing Practices” PS: Political Science & Politics. (open access version)

Peng and Eckel (2009) “Distributed Reproducible Research Using Cached Computations” Computing in Science & Engineering.

Sandvae et al (2013) “Ten Simple Rules for Reproducible Computational Research” PLOS Computational Biology.

Stodden et al (2016) “Enhancing reproducibility for computational methods” Science.

Fragile Families Challenge special issue feedback

Uncategorized No comments

We’ve recently completed the first round of reviews for papers in the special issue of Socius about the Fragile Families Challenge. There were many really interesting manuscripts submitted, but there were a variety of issues that came up repeatedly in the reviews. Therefore, in addition to providing feedback on each manuscript individually, we also developed some overall feedback that we provided to all authors. We are posting that feedback here in the hopes that it might help others who are planning to run a mass collaboration and publish a special issue.

Feedback to all authors

Based on our reading of all submissions and all reviews, we are encouraging all authors submitting revisions to the special issue to give extra attention to reviewer comments in the following three areas:

1) Accuracy. We are encouraging all revisions to focus on more clearly describing what they did, why they did it, and what might be learned from it. You must accurately report what you did. When reviewers ask why you did something, this is an important question to address. For the purpose of the special issue, you do not always need a formal justification for making a decision; if you just thought it seemed reasonable, you should say that.

In addition, we are encouraging all authors to clearly report all of their results, not just those that make their approach look more promising. When deciding whether to publish the paper, a major factor for us will be whether the paper communicates clearly the strengths and weaknesses of the approach. This factor will be much more important than whether the results are “interesting” or “promising.” Any reviewer comments about selective reporting are especially important to address.

If you used an approach that required tuning parameters (e.g., the lambda parameter in LASSO), please say how you set the parameters. The most common approaches seem to be cross-validation or using the defaults in the software. This should be clear in the papers.

2) Interest. A reader of your paper should quickly see why it would be of interest to some social scientists or some data scientists. We encourage you to add a few sentences in the introduction that that clarifies what you think are the most interesting or important ideas or results in your paper. Again, we think this will be helpful given the interdisciplinary nature of the readership. Also, if you think the main contribution is to establish the baseline against which future efforts can be compared, we think that is an important contribution.

3) Presentation. It is very important that the special issue be readable for both data scientists and social scientists. These communities sometimes use different language, and we have sought reviewers from both cultures. When reviewers are confused about something common in your field, realize that an extra sentence or reference might make the paper more readable to a diverse audience, thereby increasing the impact of your paper.

Also, inconsistent terminology often stands in the way of effective presentation. Be careful that your manuscript uses internally consistent terminology. One recommendation to promote consistency is to choose a book or an authoritative article and use its terminology. This way, terminology will be internally consistent, and confused readers are immediately pointed toward a source that can help them understand.

Stepping back from these three areas of focus, we would like to remind authors that the use of online supporting material can greatly improve accuracy, interest, and presentation. Yet very few of the manuscripts used this opportunity. Online supporting materials can be arbitrarily long and provide an opportunity to be clear about even the most mundane decisions (accuracy), reduce clutter in the paper so that non-specialists can follow the main ideas (interest), and provide an outlet to share details with researchers who wish to understand and build on your work (presentation). If there is part of your paper that will be of interest to only a small subset of readers, we strongly encourage you to put this information in the online supporting materials.

Based on our reading of all submissions and all reviews, we are encouraging all authors submitting revisions to the special issue to make certain formatting changes:

1) In the acknowledgements, you should list and cite the software that you use. This will promote reproducibility and give academic credit to folks that create software. We recommend these two sentences like this: “The results in this paper were created with software written in R 3.3.3 (R Core Team, 2017) using the following packages: ggplot2 2.2.1 (Wickham, 2009), broom 0.4.2 (Robinson, 2017), and caret 6.0-78 (Kohn, 2017). Replication code for this article is available at [ url coming soon, we are still exploring permanent homes for your code ].” If you would like to learn more about citations in R, we recommend: http://www.blopig.com/blog/2013/07/citing-r-packages-in-your-thesispaperassignments/ If you would like to learn more about citations in Python, we recommend: https://www.scipy.org/citing.html. We realize that citation standards for software are still evolving, so please ask if you have any questions.

2) Each of your papers should acknowledge the funders of the Fragile Families and Child Wellbeing Study and the funders of the Fragile Families Challenge. Therefore, we ask you add these sentences to the acknowledgements section of your paper: “Funding for the Fragile Families and Child Wellbeing Study was provided by the Eunice Kennedy Shriver National Institute of Child Health and Human Development through grants R01HD36916, R01HD39135, and R01HD40421 and by a consortium of private foundations, including the Robert Wood Johnson Foundation. Funding for the Fragile Families Challenge was provided by the Russell Sage Foundation.”

3) Several reviewers who were not part of the Challenge found the papers slightly confusing. Although we previously told you not to describe the Challenge, we think that was a mistake. You are writing a paper for a special issue of a journal, not a book chapter. Therefore, we would ask that you add one paragraph in the introduction of your paper providing a brief overview of the Challenge. Obviously the entire Challenge cannot be described in one paragraph, so you can cite our introduction to the special issue to provide more information. For now you can cite the introduction as Salganik, Lundberg, Kindel, and McLanahan “Introduction to the special issue on the Fragile Families Challenge.” We think this change will help make the articles more self-contained and will therefore increase their impact.

4) We encourage you to add a single paragraph in the introduction section of your paper that provides a roadmap to your paper. For example, “In Section 2 we describe our approach to data preparation. Then, in Section 3 we describe our procedure for variable selection. In Section 4, we describe the different models we used for prediction and compare their performance. In Section 5, we attempt to interpret the predictive models. The paper conclusions with recommendations for future research.” Although many short papers do not require this kind of roadmap, we think that it will be helpful given the interdisciplinary nature of the readership.

We are offering the two forms of support below to help you write the best paper possible.

1) Additional analyses. If the authors would like to undertake additional analyses that would require access to the holdout data, we would be happy to help facilitate that so long as all results are reported in the paper as post-Challenge results.

2) Talk with editors. We believe an open exchange often produces the best papers. If you have any questions please email us (fragilefamilieschallenge@gmail.com). If the authors would like to talk to us after having read through the reviews and charted a plan for the revisions, feel free to email us, and we would be happy arrange that.

Regarding your code, some of you have already heard from us about our efforts to reproduce your results and others will hear from us soon. We hope that while you are revising and improving your paper, you will also revise and improve your code. You will receive more specific instructions from us soon.

Fragile Families Challenge data are now available

Uncategorized No comments

Researchers interested in using the data from the Fragile Families Challenge can now apply for access through the Office of Population Research data archive. We hope that this data will be used to replicate and extend research conducted during the Challenge. We also hope that this data will be used in teaching. Many participants in the Challenge began working on it in a class, and we’ve heard from professors that the Challenge provides a great learning opportunity.

FAQ