pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: read_sas segfault

Open wudihero2 opened this issue 3 years ago • 6 comments

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

data = pd.read_sas('testdata4_nocompress.sas7bdat', format='sas7bdat', encoding='big5', chunksize=5000)

for d in data:
    print(d)

Issue Description

Hi teams, when I use read_sas to read my SAS dataset, it always occur segmentation fault in for loop error picture

Expected Behavior

It should print every chunksize with 5,000 rows in for loop.

Installed Versions

INSTALLED VERSIONS commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.9.1.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : zh_TW.UTF-8 LOCALE : Chinese (Traditional)_Taiwan.950

pandas : 1.3.4 numpy : 1.21.0 pytz : 2021.1 dateutil : 2.8.1 pip : 20.2.3 setuptools : 49.2.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

wudihero2 avatar Oct 28 '21 07:10 wudihero2

Are you able to share a SAS file to reproduce this issue?

twoertwein avatar Oct 30 '21 13:10 twoertwein

Please see attachment, thank you.

data.zip

wudihero2 avatar Nov 01 '21 14:11 wudihero2

When I read the entire file (without chunksize), I can re-producde the segfault on master.

With chunksize, I get a UnicodeDecodeError

File "pandas/pandas/core/strings/accessor.py", line 1795, in f = lambda x: decoder(x, errors)[0] UnicodeDecodeError: 'big5' codec can't decode byte 0xf0 in position 0: illegal multibyte sequence

twoertwein avatar Nov 05 '21 21:11 twoertwein

I changed the Chinese test data into English, you can try again, thank you! testdata.zip

wudihero2 avatar Nov 07 '21 12:11 wudihero2

This is fixed in one of https://github.com/pandas-dev/pandas/pull/47113 https://github.com/pandas-dev/pandas/pull/47115 (I don't remember which one)

jonashaag avatar May 30 '22 12:05 jonashaag

This is fixed in one of #47113 #47115 (I don't remember which one)

we could close this issue if there are relevant tests already added else need to add a test.

simonjayhawkins avatar Aug 05 '22 18:08 simonjayhawkins