Sav files created by IBM Proprietary SPSS Modeler Software on IBM Cloud are not properly readable by Pyreadstat(python)/Haven(R)

Open ananjay-gurjar-ibm opened this issue 1 year ago • 1 comments

Consider the given sav files sav-read-issue.zip created by IBM Proprietery SPSS Modeler on IBM Cloud. The original data and corresponding metadata written in the file as read by SPSS Statistics

The same file when read from Python's Pyreadstat and R's Haven library shows up as below:

Clearly from the metadata in the above screenshot python was able to figure out that string column is A16 (i.e. alphanumeric string of 16 bytes) but it ended up reading only 8 bytes of data. The metadata Internal Type code which is used to specify length of a string column is correctly set in the given file (ref screenshot)

Now since sav file is continuous bytes of data it messes up the whole structure which explains the garbage value in double column(i.e. num2).

This problem also leads to python and R giving out Unable to convert string to the requested encoding (invalid byte sequence) incase of file containing multiple lines which I suspect is coming from library(ReadStat) trying to decode bytes written for double data to string (as string is utf-8 encoded) from the second line.

cc: @sainathmekala22

Oct 21 '24 10:10 ananjay-gurjar-ibm

This is interesting. Most SPSS files emit "blank variables" for each 8 bytes of a string variable. ReadStat relies on this structure to determine the data locations. See: https://www.gnu.org/software/pspp/pspp-dev/html_node/Variable-Record.html

The file you provided does not follow this convention. The internal structure is more logical, but it also deviates from the established norm, so ReadStat doesn't read it correctly.

May 24 '25 23:05 evanmiller