To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!
Working with .dta files in Python
The primary way to work with a .dta file in Python is to use the
read_stata() function in pandas, as follows:
import pandas as pd df_path = "/Users/user/FFC/data/background.dta" df = None with open(df_path, "r") as f: df = pd.read_stata(f) print df.head()
This creates a
pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.
- Documentation for
pandas.read_stata()is available here.
read_stata()function accepts either a file path or a read buffer (as above).
- The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!