Our Blog

Our Blog

Using .dta files in R

Uncategorized 2 comments
featured image

We’ve just released the Fragile Families Challenge data in .dta format, which means the files now include metadata that was not available in the .csv files that we initially released. The .dta format is native to Stata, and you might prefer to use R. So, in this post, I’ll give some pointers to getting up and running with the .dta file in R. If you have questions—and suggestions—please feel free to post them at the bottom of this post.

There are many ways to read .dta files into R. In this post I’ll use haven because it is part of the tidyverse.

Here’s how you can read in the .dta files (and I’ll read in the .csv file too so that we can compare them):

library(tidyverse)
library(haven)
ffc.stata <- read_dta(file = "background.dta")
ffc.csv <- read_csv(file = "background.csv")

One you start working with ffc.stata, one thing you will notice is that many columns are of type labelled, which is not common in R. To convert labelled to factors, use as_factor (not as.factor). For more on labelled and as_factors, see the documentation of haven.

Another thing you will notice is that some of the missing data codes from the Stata file don’t get converted to NA. For example, consider the variable "m1b9b11" for the person with challengeID 1104. This is a missing value that should be NA. This gets parsed correctly in the csv files but not the Stata file.

is.na(ffc.stata[(ffc.stata$challengeid==1104), "m1b9b11"])
is.na(ffc.csv[(ffc.csv$challengeID==1104), "m1b9b11"])

If you have questions---and suggestions---about working with .dta files in R, please feel free to post them below.

Notes:

  • The read_dta function in haven is a wrapper around the ReadStat C library.
  • The read.dta function in the foreign library was popular in the past, but that function is now frozen and will not support anything after Stata 12.
  • Another way to read .dta files into R is the readstata13 package, which, despite what the name suggests, can read Stata 13 and Stata 14 files.

About Matt Salganik

Matthew Salganik is a Professor of Sociology at Princeton University. He is also the author of Bit by Bit: Social Research in the Digital Age (http://www.bitbybitbook.com). You can learn more about his research at http://www.princeton.edu/~mjs3.

2 comments

Ben Smith - August 4, 2020 Reply

Matt,
I am a PhD candidate in economics. I am using the Fragile Families dataset for my dissertation. I am trying to learn R, though I am progressing slowly.
Your post on using as_factor() to transform these data was incredibly helpful. However, I am stuck again. I have transformed most of the variables I would like to use. I am trying to examine correlations and would like to do some pairwise scatterplots. I am running into some errors. It seems like R cannot plot these factor doubles. Do you have a recommendation on how to transform these data types further so that they can be graphed more easily?
Thank you,
Benjamin

Matt Salganik - December 3, 2021 Reply

I don’t have any recommendations, unfortunately. Good luck with your project. Matt

Leave a Reply Cancel