pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

working with big spss files

Open AtanasAtanasovIpsos opened this issue 4 years ago • 10 comments

First I want to say this library is great! We have some raw SPSS files that are extremely large(about 6GB with 1.3 million of vars). SPSS itself can work with those. Pyreadstat however cannot handle it even with the option of reading the metadata only. While there is still plenty of RAM available left in the system (the usage of Python is about 1.5 GB) and there is 64 GB ram on the machine. The stacktrace is as follows:

File "C:\Users\thomas\Downloads\Ipsos\Carlsberg\build_column_overview.py", line 32, in <module>
    df, meta = pyreadstat.read_sav(os.path.join(path, file), metadataonly=True)
  File "pyreadstat\pyreadstat.pyx", line 325, in pyreadstat.pyreadstat.read_sav
  File "pyreadstat\_readstat_parser.pyx", line 945, in pyreadstat._readstat_parser.run_conversion
  File "pyreadstat\_readstat_parser.pyx", line 784, in pyreadstat._readstat_parser.run_readstat_parser
  File "pyreadstat\_readstat_parser.pyx", line 714, in pyreadstat._readstat_parser.check_exit_status
ReadstatError: Unable to allocate memory

This happens both on Windows10 64 bit and Linux64bit with python(64 bit)=3.8 and pyreadstat=1.0.2

Now I understand that spss is probably not the best file-format for this data, but unfortunately, that is what we have.

AtanasAtanasovIpsos avatar Sep 21 '20 09:09 AtanasAtanasovIpsos

thanks for the report. Would you be able to produce some python code that using pyreadstat.write_sav, produces a large sample file that raises the error on your end? This is to be able to reproduce the issue but without the need of you transferring the file (but just the code to produce the file)

ofajardo avatar Sep 21 '20 10:09 ofajardo

Thanks for the reply. I will try playing with the write_sav and will see if I can produce such a file.

AtanasAtanasovIpsos avatar Sep 21 '20 10:09 AtanasAtanasovIpsos

Here is an example of a code that will generate about 84.6MB of a file that cannot be read back due to the same error.

import random
import pandas as pd
import numpy as np
import pyreadstat

N=1300000
DataSet = pd.DataFrame(np.random.randn(1, N),columns=['A'+str(x) for x in range(1,N+1)])
pyreadstat.write_sav(DataSet,'DataFile.sav')
#%%
df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True)

AtanasAtanasovIpsos avatar Sep 21 '20 13:09 AtanasAtanasovIpsos

Hi, ReadStat restricts individual memory allocations to 16 MB - this is to prevent denial of service type scenarios with mal-formed data. With 1.3 million variables in your file you are likely hitting that limit with the column metadata.

Some options are 1) Increasing the limit 2) Adding an option to specify the limit and 3) Removing the limit altogether.

evanmiller avatar Sep 21 '20 16:09 evanmiller

The second option would be the best one for me. Or something like:

df, meta = pyreadstat.read_sav('DataFile.sav', metadataonly=True,safetylimits=False)

AtanasAtanasovIpsos avatar Sep 21 '20 17:09 AtanasAtanasovIpsos

That's a good suggestion.

Given the experience with pyreadr I think 1 is nit good because there will be always somebody with a larger file that will hit the new limit. I personally think removing it would be better, as it was done in pyreadr. That will be less confusing for the users, as they don't need to be aware if the extra flag to inactivate the limit.

ofajardo avatar Sep 21 '20 18:09 ofajardo

you are right about the bigger files. now that I think more about it, removing the limit seems also good solution :)

AtanasAtanasovIpsos avatar Sep 21 '20 19:09 AtanasAtanasovIpsos

hi @evanmiller, is this something coming in Readstat version 1.1.5, or not yet? (just for clarity)

ofajardo avatar Dec 03 '20 17:12 ofajardo

@ofajardo No solution yet

evanmiller avatar Dec 03 '20 18:12 evanmiller

@evanmiller Ok thanks!

ofajardo avatar Dec 03 '20 18:12 ofajardo