ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Notes and metadata not converted to utf-8

Open alipatti opened this issue 5 months ago • 0 comments

It appears that ReadStat is not converting the encoding of some metadata for Stata dta and SAS xpt files.

This came up in Roche/pyreadstat#298 because pyreadstat expects all text to be returned to it as utf-8 and errors when this is not the case. Tagging @ofajardo (pyreadstat maintainer).

Examples

Errors occur when reading notes from stata .dta files (#73) ("These data are a subset of those used in the study Caulkins, J.P. and R. Padman (1993), \x93Quantity Discounts and Quality Premia for Illicit Drugs\x94, Journal of the American Statistical Association, 88, 748-757"):

wget http://www.principlesofeconometrics.com/stata/cocaine.dta

# errors because readstat returns notes as WINDOWS-1252 encoded text
python -c 'import pyreadstat; pyreadstat.read_dta("cocaine.dta")'

For value labels ("don\xe2\x80�t know")

wget https://gss.norc.org/documents/stata/GSS_stata.zip
unzip GSS_stata.zip GSS_stata/gss7224_r1.dta
python -c 'import pyreadstat; pyreadstat.read_dta("GSS_stata/gss7224_r1.dta", row_limit = 10)'

For column labels ("Ferritin(\xb5g/L)"):

wget https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/FERTIN_L.xpt
python -c 'import pyreadstat; pyreadstat.read_xport("FERTIN_L.xpt")'

Similar issue in flavor to #152 and #172.

alipatti avatar Jul 25 '25 14:07 alipatti