Error reading non UTF-8 headers even after specifying encoding
pyreadstat throws a utf error when reading this file or this file. Haven is able to read them both successfully, which makes me think this is a problem with pyreadstat rather than readstat itself.
It appears that there are Windows-1252 encoded "smart quotes" in the header that pyreadstat is trying to read as utf. Passing encoding="..." to read_dta has no effect.
To reproduce:
# get file
curl http://www.principlesofeconometrics.com/stata/cocaine.dta
# fails to read
python -c '
import pyreadstat
pyreadstat.read_dta("cocaine.dta")
pyreadstat.read_dta("cocaine.dta", encoding = "WINDOWS-1252")
pyreadstat.read_dta("cocaine.dta", encoding = "CP1252")
'
# successfully reads
python -c 'import pandas as pd; pd.read_stata("cocaine.dta")'
R -e 'haven::read_dta("cocaine.dta")'
Full stack trace:
Traceback (most recent call last):
File "<string>", line 1, in <module>
import pyreadstat; pyreadstat.read_dta("cocaine.dta")
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "pyreadstat/pyreadstat.pyx", line 301, in pyreadstat.pyreadstat.read_dta
File "pyreadstat/_readstat_parser.pyx", line 1176, in pyreadstat._readstat_parser.run_conversion
File "pyreadstat/_readstat_parser.pyx", line 796, in pyreadstat._readstat_parser.handle_note
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 88: invalid start byte
The relevant part of cocaine.dta. \x93 is a non-utf quote that pyreadstat is interpreting as utf8.
00000370 20 43 61 75 6c 6b 69 6e 73 2c 20 4a 2e 50 2e 20 | Caulkins, J.P. |
00000380 61 6e 64 20 52 2e 20 50 61 64 6d 61 6e 20 28 31 |and R. Padman (1|
00000390 39 39 33 29 2c 20 93 51 75 61 6e 74 69 74 79 20 |993), .Quantity | <- BAD LINE
000003a0 44 69 73 63 6f 75 6e 74 73 20 61 6e 64 20 51 75 |Discounts and Qu|
000003b0 61 6c 69 74 79 20 50 72 65 6d 69 61 20 66 6f 72 |ality Premia for|
installed from pip in a virtualenv, 64-bit Mac, python 3.13, pyreadstat 1.30
Brute forcing all iconv encodings does nothing
import requests
import pyreadstat
url = "https://gist.githubusercontent.com/hakre/4188459/raw/13b4171a4415a1a5b360a6b39fe8661913622aa0/iconv-l.txt"
encodings = requests.get(url).text.split()
for e in encodings:
try:
pyreadstat.read_dta("cocaine.dta", encoding=e)
print(f"succeeded on {e}")
break
except:
print(f"failed {e}")
Thanks for the report, will check it when I get a bit of time. One difference between Haven and Pyreadstat is that they update sources from Resdstat much less frequently I think. Maybe the newer version in pyreadstat is causing the issue? If so, you could try an older version of pyreadstat
Another thing is that the error is coming from handle_note, that indicates or at least suggests the offending character is not in the data itself, but in a note attached to the file. Not sure how Haven handles those (maybe they dont read those and thats why it is not failing?)
Actually Haven does read the notes ... How do those look like?
Haven leaves the characters as their hex codes:
r$> df <- haven::read_dta("data/dta/cocaine.dta")
r$> attributes(df)
$class
[1] "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56
$notes
[1] "These data are a subset of those used in the study Caulkins, J.P. and R. Pa
dman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Jo
urnal of the American Statistical Association, 88, 748-757"
[2] "1"
$names
[1] "price" "quant" "qual" "trend"
Right, this is what pyreadstat gets from Readstat and just before trying to cast to a python str, where the error raises:
b'These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Journal of the American Statistical Association, 88, 748-757'
So, I guess the difference is that R is tolerant to bad characters, while Python is not ...
and this works:
>>> b = b'These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Journal of the American Statistical Association, 88, 748-757'
>>> b.decode('WINDOWS-1252')
'These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), “Quantity Discounts and Quality Premia for Illicit Drugs”, Journal of the American Statistical Association, 88, 748-757'
>>> b.decode('CP1252')
'These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), “Quantity Discounts and Quality Premia for Illicit Drugs”, Journal of the American Statistical Association, 88, 748-757'
But this not:
>>> b.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 88: invalid start byte
Now, the thing is that Readstat should be using iconv to translate from CP1252 or whatever other encoding to UTF-8, so that when Python receives it, it is already good and compliant utf-8, and it seems to me it is not doing it. It looks to me like iconv conversion is not implemented on Readstat side for notes.
Would you report this to Readstat? I think they should be doing the conversion to utf-8 using iconv, for consistency, as for any other string, including those in values, those in column names, column labels, etc. they are doing it. It also hapened in the past they were not doing it for labels and they implemented it.
I can open an issue in the readstat repo, but FYI this problem is not limited to just notes.
Depending on the file, this can happen in the notes (as before, src), value labels (src) and column labels (src).
# notes (as above) -- errors on line 796
wget http://www.principlesofeconometrics.com/stata/cocaine.dta
python -c 'import pyreadstat; pyreadstat.read_dta("cocaine.dta")'
# value labels -- errors on line 728
wget https://gss.norc.org/documents/stata/GSS_stata.zip
unzip GSS_stata.zip GSS_stata/gss7224_r1.dta
python -c 'import pyreadstat; pyreadstat.read_dta("GSS_stata/gss7224_r1.dta", row_limit = 10)'
# column labels -- errors on line 514
wget https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/FERTIN_L.xpt
python -c 'import pyreadstat; pyreadstat.read_xport("FERTIN_L.xpt")'
Thanks for all your help!
I am really surprised! Please open the issue in Resdstat, and lets see what is their opinion
Maybe you could have it just try/except with a message on a failure to read the notes or let someone set a flag to false to skip them? Readstat doesn't fail on it's own because these files load for me in https://github.com/jrothbaum/polars_readstat (I could just be skipping that step in my implementation - I started from SAS and haven't necessarily updated things to pick up any additional metadata that Stata or SPSS files might have)