We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.
Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):
ffc.stata <- read_dta(file = "background.dta")
ffc.csv <- read_csv(file = "background.csv")
One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.
Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.
If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.
- The read_dta function in haven is a wrapper around the ReadStat C library.
- The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
- Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.