ReadStat icon indicating copy to clipboard operation
ReadStat copied to clipboard

Error reading SAS data set row with many numeric variables and character compression

Open mnizol opened this issue 3 years ago • 3 comments

When reading a certain SAS data set using pyreadstat.read_sas7bdat(), which uses ReadStat v1.1.7, I get the following error:

"A row in the file was not the expected length."

The data set in question uses character compression. This may be related to the closed issue https://github.com/WizardMac/ReadStat/issues/35.

I was able to narrow down the problem to a specific row, which I've attached below after obfuscating variable names and contents (the obfuscated version of the row copied below still exhibits the error).

compression_bug.sas7bdat.zip

mnizol avatar Oct 08 '21 22:10 mnizol

@mnizol It will be useful to know if this file is opened successfully by other packages e.g. Python sas7bdat.

Technical notes for myself:

The decompression is tripping up on the control character 0x68, which decodes as a blank insertion of length 256*8 + 17 + (value of next byte), which is exceeding the length of the output buffer. It's not necessarily this control code that is the problem as others also have decompression lengths longer than 256. If I recall correctly there is/was some disagreement about the value of the length multiplier being 256 vs something else (64?).

evanmiller avatar Nov 08 '21 14:11 evanmiller

Likely duplicate of #245

evanmiller avatar Nov 08 '21 14:11 evanmiller

Further notes: both files have unrecognized control codes 1, 2, and 3. These will need investigation.

evanmiller avatar Nov 08 '21 15:11 evanmiller