pyreadstat
pyreadstat copied to clipboard
Unable to round trip some files with some specific column and cell values
Describe the issue A clear and concise description of what the issue is.
To Reproduce
in one cell you need 755 ascii letters followed by a non-ascii character, you need a column with 5 letters and ending in a 2, and another column ending in a 0 and starting with an ascii letter
Here's an example to generate them:
from __future__ import annotations
import pathlib
import os
import io
import tempfile
import pandas as pd
import pyreadstat
"""
numpy==1.20.2
pandas==1.2.4
pyreadstat==1.1.0
python-dateutil==2.8.1
pytz==2021.1
six==1.15.0
"""
def main():
with tempfile.TemporaryDirectory() as tmp:
tmp_path = pathlib.Path(tmp)
dst_path = os.fsdecode(tmp_path / "eg.sav")
df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
pyreadstat.write_sav(
dst_path=tmp_path / "eg.sav",
df=df,
column_labels=["x", "y", "z"],
)
pyreadstat.read_sav(dst_path)
if __name__ == "__main__":
main()
this results in:
Traceback (most recent call last):
File "foo.py", line 37, in <module>
main()
File "foo.py", line 33, in main
pyreadstat.read_sav(dst_path)
File "pyreadstat/pyreadstat.pyx", line 342, in pyreadstat.pyreadstat.read_sav
File "pyreadstat/_readstat_parser.pyx", line 1034, in pyreadstat._readstat_parser.run_conversion
File "pyreadstat/_readstat_parser.pyx", line 845, in pyreadstat._readstat_parser.run_readstat_parser
File "pyreadstat/_readstat_parser.pyx", line 775, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)
Expected behavior I'd expect to be able to round trip it
Setup Information:
How did you install pyreadstat? pip, see pip freeze output above
Platform: Ubuntu 20.04.2 LTS
Python Version Python 3.8.5 (default, Jan 27 2021, 15:41:15)
Using Virtualenv or condaenv? python3.8 -m venv
thanks for the reproducible report. It seems to be coming from the C library, so I filed an issue over there.
it's also odd because changes like
- df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
+ df = pd.read_csv(io.StringIO('aaaaa3,y,a0\n\n"' + ("a" * 755) + 'ü"'))
doesn't cause the failure
super strange ... I will report that in the issue in Readstat
it is possible to reproduce this error without any international character, (using only 'a's in this example) if the length of the string is at least 757 (in contrast to 756 if there is the international character). Another important thing to reproduce this is that the numerical values must be NANs. If these are let's say 1.0 then everything is fine. The issue can be reproduced in pure C code using Readstat, meaning it is not a failure caused by python or pyreadstat, see this