Using .dta files in Python

Uncategorized 4 comments

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!

Working with .dta files in Python

The primary way to work with a .dta file in Python is to use the read_stata() function in pandas, as follows:

import pandas as pd

df_path = "/Users/user/FFC/data/background.dta"

df = None
with open(df_path, "r") as f:
    df = pd.read_stata(f)
    print df.head()

This creates a pandas.DataFrame object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.

Notes

Documentation for pandas.read_stata() is available here.
The read_stata() function accepts either a file path or a read buffer (as above).
The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!

4 comments

micky - August 4, 2017 Reply

I got the following error while running your code . especially the read_stata function.

File “d:\Python-learn\linear_regression.py”, line 15, in
data = pd.read_stata(f)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 171, in read_stata
chunksize=chunksize, encoding=encoding)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 995, in __init__
self._read_header()
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 1017, in _read_header
self._read_old_header(first_char)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 1220, in _read_old_header
raise ValueError(_version_error)
ValueError: Version of given Stata file is not 104, 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), or 118 (Stata 14)

Michael - February 20, 2019 Reply

Hey Alex,

I can’t open this one stata file using pandas due to a UTF 8 decoding error.
Do you have any ideas as how to fix this ?

Karla - October 18, 2019 Reply

Hello Alex,

I’m having the same problem as Michael.

The error is: —————————————————————————
UnicodeDecodeError Traceback (most recent call last)
in
3 ipumsi_mexico = None
4 with open(ipumsi_mexico_path, “r”) as f:
—-> 5 ipumsi_mexico = pd.read_stata(f)
6 print(ipumsi_mexico.head())

/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
–> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
184 columns=columns,
185 order_categoricals=order_categoricals,
–> 186 chunksize=chunksize)
187
188 if iterator or chunksize:

/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in __init__(self, path_or_buf, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, encoding, chunksize)
995 else:
996 # Copy to BytesIO, and ensure no encoding
–> 997 contents = path_or_buf.read()
998 self.path_or_buf = BytesIO(contents)
999

/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
–> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x9e in position 171: invalid start byte

Peter Francis - April 19, 2021 Reply

You need to pass the file path to df.read_stata(), not the actual file object. Check out the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_stata.html

Our Blog

Alex Kindel

May 25, 2017