pandas
pandas copied to clipboard
BUG: read_sas segfault
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
data = pd.read_sas('testdata4_nocompress.sas7bdat', format='sas7bdat', encoding='big5', chunksize=5000)
for d in data:
print(d)
Issue Description
Hi teams, when I use read_sas to read my SAS dataset, it always occur segmentation fault in for loop
Expected Behavior
It should print every chunksize with 5,000 rows in for loop.
Installed Versions
INSTALLED VERSIONS commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.9.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : zh_TW.UTF-8 LOCALE : Chinese (Traditional)_Taiwan.950
pandas : 1.3.4 numpy : 1.21.0 pytz : 2021.1 dateutil : 2.8.1 pip : 20.2.3 setuptools : 49.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Are you able to share a SAS file to reproduce this issue?
When I read the entire file (without chunksize), I can re-producde the segfault on master.
With chunksize, I get a UnicodeDecodeError
File "pandas/pandas/core/strings/accessor.py", line 1795, in
f = lambda x: decoder(x, errors)[0] UnicodeDecodeError: 'big5' codec can't decode byte 0xf0 in position 0: illegal multibyte sequence
I changed the Chinese test data into English, you can try again, thank you! testdata.zip
This is fixed in one of https://github.com/pandas-dev/pandas/pull/47113 https://github.com/pandas-dev/pandas/pull/47115 (I don't remember which one)
This is fixed in one of #47113 #47115 (I don't remember which one)
we could close this issue if there are relevant tests already added else need to add a test.