Our Blog

Our Blog

Using .dta files in Python

Uncategorized 1 comment
featured image

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.

Notes

  • Documentation for pandas.read_stata() is available here.
  • The read_stata() function accepts either a file path or a read buffer (as above).
  • The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

About Alex Kindel

Alex is a PhD student in Sociology at Princeton University. You can read more about his work at princeton.edu/~akindel.

1 comment

Michael - February 20, 2019 Reply

Hey Alex,

I can’t open this one stata file using pandas due to a UTF 8 decoding error.
Do you have any ideas as how to fix this ?

Add your comment