Using .dta files in Python

To make data cleaning easier, we’ve released a version of the background variables file in .dta format, generated by Stata. In addition to the table of background data, this file contains metadata on the types of each column, as well as a short label describing the survey questions that correspond to each column. Our hope is that this version of the data file will make it easier for participants to select and interpret variables in their predictive models. If you have any questions or suggestions, please let us know in the comments below!
Working with .dta files in Python
The primary way to work with a .dta file in Python is to use the read_stata()
function in pandas, as follows:
import pandas as pd df_path = "/Users/user/FFC/data/background.dta" df = None with open(df_path, "r") as f: df = pd.read_stata(f) print df.head()
This creates a pandas.DataFrame
object that contains the background variables. By default, pandas will automatically retain the data types as defined in the .dta file.
Notes
- Documentation for
pandas.read_stata()
is available here. - The
read_stata()
function accepts either a file path or a read buffer (as above). - The .dta file is generated by Stata 14. There are some compatibility issues with pandas and Stata 14 .dta files due to changes to field size limits from earlier versions of Stata. In particular, any UTF-8 decoding errors you run into are likely due to this issue. Please let us know if you run into any trouble working with the file in pandas!
micky - August 4, 2017
I got the following error while running your code . especially the read_stata function.
File “d:\Python-learn\linear_regression.py”, line 15, in
data = pd.read_stata(f)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 171, in read_stata
chunksize=chunksize, encoding=encoding)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 995, in __init__
self._read_header()
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 1017, in _read_header
self._read_old_header(first_char)
File “C:\Python27\Lib\site-packages\pandas\io\stata.py”, line 1220, in _read_old_header
raise ValueError(_version_error)
ValueError: Version of given Stata file is not 104, 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), or 118 (Stata 14)
Michael - February 20, 2019
Hey Alex,
I can’t open this one stata file using pandas due to a UTF 8 decoding error.
Do you have any ideas as how to fix this ?
Karla - October 18, 2019
Hello Alex,
I’m having the same problem as Michael.
The error is: —————————————————————————
UnicodeDecodeError Traceback (most recent call last)
in
3 ipumsi_mexico = None
4 with open(ipumsi_mexico_path, “r”) as f:
—-> 5 ipumsi_mexico = pd.read_stata(f)
6 print(ipumsi_mexico.head())
/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
–> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg
/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
–> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg
/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
184 columns=columns,
185 order_categoricals=order_categoricals,
–> 186 chunksize=chunksize)
187
188 if iterator or chunksize:
/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
–> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg
/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
–> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg
/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in __init__(self, path_or_buf, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, encoding, chunksize)
995 else:
996 # Copy to BytesIO, and ensure no encoding
–> 997 contents = path_or_buf.read()
998 self.path_or_buf = BytesIO(contents)
999
/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
–> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x9e in position 171: invalid start byte
Peter Francis - April 19, 2021
You need to pass the file path to df.read_stata(), not the actual file object. Check out the documentation here: https://pandas.pydata.org/docs/reference/api/pandas.read_stata.html