pyreadstat Unable to round trip some files with some specific column and cell values

Describe the issue A clear and concise description of what the issue is.

To Reproduce

in one cell you need 755 ascii letters followed by a non-ascii character, you need a column with 5 letters and ending in a 2, and another column ending in a 0 and starting with an ascii letter

Here's an example to generate them:

from __future__ import annotations

import pathlib
import os
import io
import tempfile

import pandas as pd
import pyreadstat


"""
numpy==1.20.2
pandas==1.2.4
pyreadstat==1.1.0
python-dateutil==2.8.1
pytz==2021.1
six==1.15.0
"""


def main():
    with tempfile.TemporaryDirectory() as tmp:
        tmp_path = pathlib.Path(tmp)
        dst_path = os.fsdecode(tmp_path / "eg.sav")

        df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
        pyreadstat.write_sav(
            dst_path=tmp_path / "eg.sav",
            df=df,
            column_labels=["x", "y", "z"],
        )
        pyreadstat.read_sav(dst_path)


if __name__ == "__main__":
    main()

this results in:

Traceback (most recent call last):
  File "foo.py", line 37, in <module>
    main()
  File "foo.py", line 33, in main
    pyreadstat.read_sav(dst_path)
  File "pyreadstat/pyreadstat.pyx", line 342, in pyreadstat.pyreadstat.read_sav
  File "pyreadstat/_readstat_parser.pyx", line 1034, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat/_readstat_parser.pyx", line 845, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat/_readstat_parser.pyx", line 775, in pyreadstat._readstat_parser.check_exit_status
pyreadstat._readstat_parser.ReadstatError: Unable to convert string to the requested encoding (invalid byte sequence)

Expected behavior I'd expect to be able to round trip it

Setup Information: How did you install pyreadstat? pip, see pip freeze output above Platform: Ubuntu 20.04.2 LTS Python Version Python 3.8.5 (default, Jan 27 2021, 15:41:15) Using Virtualenv or condaenv? python3.8 -m venv

Apr 23 '21 09:04 graingert

thanks for the reproducible report. It seems to be coming from the C library, so I filed an issue over there.

Apr 23 '21 10:04 ofajardo

it's also odd because changes like

-        df = pd.read_csv(io.StringIO('aaaaa2,y,a0\n\n"' + ("a" * 755) + 'ü"'))
+        df = pd.read_csv(io.StringIO('aaaaa3,y,a0\n\n"' + ("a" * 755) + 'ü"'))

doesn't cause the failure

Apr 23 '21 10:04 graingert

super strange ... I will report that in the issue in Readstat

Apr 23 '21 12:04 ofajardo

it is possible to reproduce this error without any international character, (using only 'a's in this example) if the length of the string is at least 757 (in contrast to 756 if there is the international character). Another important thing to reproduce this is that the numerical values must be NANs. If these are let's say 1.0 then everything is fine. The issue can be reproduced in pure C code using Readstat, meaning it is not a failure caused by python or pyreadstat, see this

Dec 15 '21 16:12 ofajardo

pyreadstat pyreadstat copied to clipboard

Unable to round trip some files with some specific column and cell values

pyreadstat
pyreadstat copied to clipboard