pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

Improvements to reading in chunks and multiprocessing to speed up reading large files

Open gdementen opened this issue 9 months ago • 7 comments

This might be related to #193, but I tried to investigate the situation and found some interesting bits.

I have a 77Gb sas7bat file (see below for some details about the file). Since that does not fit in my server memory, I tried pyreadstat.read_file_in_chunks(), which worked but... took 25 hours to complete (using the default chunksize of 100_000).

First, reading only the metadata of that file (using empty_df, meta = pyreadstat.read_sas7bdat(src_path, metadataonly=True) took more than 7 minutes ! I don't think it is supposed to be so slow, but I guess that this is a readstat problem and not a pyreadstat problem.

However, after investigation, I think that the ways chunking is implemented in pyreadstat (by repeatedly calling the "normal" read function) also makes things a lot slower than it could, especially with "small" chunk sizes.

Even when reading the file repeatedly (and thus I guess Windows caches it), there is some fixed overhead of around 10 seconds to read a single chunk, and I suppose it is related to reading the metadata over and over again for each chunk. Here are some timings of reading the first chunk with varying chunk sizes. As you can see, if you subtract approximately 10 seconds for each, this is almost perfectly linear as it should.

Chunk size -> time:

  • 100_000 -> 10.96 seconds (9124 rows per second)
  • 200_000 -> 12.97 seconds (15417 rows per second)
  • 400_000 -> 14.67 seconds (27274 rows per second)
  • 800_000 -> 19.30 seconds (41455 rows per second)
  • 1_000_000 -> 20.99 seconds (47639 rows per second)
  • 2_000_000 -> 30.20 seconds (66232 rows per second)
  • 4_000_000 -> 51.29 seconds (77991 rows per second)
  • 8_000_000 -> 87.97 seconds (90943 rows per second)

I also noticed file reading get slower and slower as you increase the row offset. I suppose there is some overhead to move to a given row offset in the file. This probably cannot be helped on pyreadstat side of things, but might be useful to investigate nonetheless.

  • 5_000_000 rows chunk
    • first chunk done in 52.19 seconds (95799 rows per second)
    • last chunk done in 1 minute 31.37 seconds (54722 rows per second)
  • 10_000_000 rows chunk
    • first chunk done in 1 minute 34.19 seconds (106168 rows per second)
    • last chunk done in 2 minutes 30.06 seconds (66642 rows per second)

In any case, I think it would help a lot (work around both problems) if you changed the chunking approach to avoid paying for the "setup overhead" (reading the metadata and whatever else is done per file) over and over and read chunks one after another instead of closing the file, and re-jumping to the "current" chunk each time.

I don't think this is relevant but I am on Windows 64bit using Python 3.12 from Anaconda and pyreadstat is installed using pip.

Relevant METADATA about my file

file_encoding = WINDOWS-1252 file_format = sas7bdat number_columns = 32 number_rows = 385169714

readstat_variable_types = {'year': 'double', 'ID_DEMO_C': 'string', 'CD_SEX': 'string', 'MS_AGE': 'double', 'CD_REFNIS': 'double', 'arr': 'double', 'reg': 'string', 'CD_NATLTY': 'string', 'natgroup': 'string', 'CD_CNTRY_BTH': 'string', 'CD_FST_NATLTY': 'string', 'HH_TYPE_LIPRO': 'double', 'HH_POS_LIPRO': 'double', 'DT_REFDATE': 'double', 'DT_BTH': 'double', 'CD_CIV': 'string', 'CD_REG': 'string', 'ID_DEMO_HH_HD_C': 'string', 'ID_HH_MEMBR': 'double', 'CD_REL_HH_HD': 'string', 'ID_HH_C': 'string', 'ID_DEMO_PAR_1_C': 'string', 'ID_DEMO_PAR_2_C': 'string', 'CD_SEX_PAR_1': 'string', 'CD_SEX_PAR_2': 'string', 'CD_FST_NATLTY_PAR_1': 'string', 'CD_FST_NATLTY_PAR_2': 'string', 'CD_NATLTY_PAR_1': 'string', 'CD_NATLTY_PAR_2': 'string', 'CD_DSCNT_NATLTY': 'string', 'CD_NAT_DSCNT': 'double', 'CD_DSCNT': 'string'}

variable_storage_width = {'year': 4, 'ID_DEMO_C': 16, 'CD_SEX': 1, 'MS_AGE': 3, 'CD_REFNIS': 8, 'arr': 3, 'reg': 6, 'CD_NATLTY': 3, 'natgroup': 8, 'CD_CNTRY_BTH': 3, 'CD_FST_NATLTY': 3, 'HH_TYPE_LIPRO': 3, 'HH_POS_LIPRO': 3, 'DT_REFDATE': 8, 'DT_BTH': 8, 'CD_CIV': 1, 'CD_REG': 1, 'ID_DEMO_HH_HD_C': 16, 'ID_HH_MEMBR': 8, 'CD_REL_HH_HD': 2, 'ID_HH_C': 16, 'ID_DEMO_PAR_1_C': 16, 'ID_DEMO_PAR_2_C': 16, 'CD_SEX_PAR_1': 1, 'CD_SEX_PAR_2': 1, 'CD_FST_NATLTY_PAR_1': 3, 'CD_FST_NATLTY_PAR_2': 3, 'CD_NATLTY_PAR_1': 3, 'CD_NATLTY_PAR_2': 3, 'CD_DSCNT_NATLTY': 3, 'CD_NAT_DSCNT': 3, 'CD_DSCNT': 25}

gdementen avatar Apr 04 '25 13:04 gdementen

Thanks for the suggestion. I guess it is an idea to explore. First it has to be checked if it works to read chunks repeatedly in Readstat (there is a seek function at least). It would require major changes in Pyreadstat though.

In the other hand, as far as I remember with some tests I did, there was a massive gain in speed by using multiprocessing. In multiprocessing, every worker is independent and they do not communicate with each other, therefore they need to anyway read the metadata, so maybe this new approach does not bring so much benefit after all.

In your case I would try to set the multiprocessing to True, set the chunksize to a larger number (depending on what your RAM allows) and setting the number of CPUs to the max you have in your machine, maybe that helps improving the issue for now.

ofajardo avatar May 12 '25 15:05 ofajardo

Haha! I did not realize multiprocessing was possible in combination with chunking. I thought it was an exclusive or thing. I was a bit mislead by that sentence in the README:

A challenge when reading large files is the time consumed in the operation. In order to alleviate this pyreadstat provides a function "read_file_multiprocessing" to read a file in parallel processes using the python multiprocessing library. As it reads the whole file in one go you need to have enough RAM for the operation. If that is not the case look at Reading rows in chunks (next section).

But reading the chunking section in full is very clear so I am unsure you can do anything about my confusion 😉. Thank you for the pointer.

After my experiments, I indeed worked around the problem by incrementing the chunksize.

In multiprocessing, every worker is independent and they do not communicate with each other, therefore they need to anyway read the metadata, so maybe this new approach does not bring so much benefit after all.

Well, assuming you do the largish refactor of pyreadstat to make it possible to keep a readstat parser instance around and reuse it for several chunks, leveraging that in the multiprocessing code does not seem too hard. I imagine that each process could keep its own parser instance and the main process could send "jobs" (just row offsets and number in fact) to each worker process using queues. That would still be much less metadata reading/parser initializing (one per process) than currently (one per chunk). Maybe I am too optimistic... 😉

gdementen avatar May 14 '25 10:05 gdementen

I am glad to hear that your problem is better now with the existing solution.

Maybe you are right and there is still something to gain there, but it sounds complicated to me. As right now I do not have much time to dedicate to this project, and there are other challenging issues to solve which are more important, I will park this for now.

ofajardo avatar May 14 '25 11:05 ofajardo

also, just out of curiosity, you said initially the file took 25 hours to complete. How long is it taking now? what is the chunksize and how many cores are you using?

ofajardo avatar May 14 '25 14:05 ofajardo

Maybe you are right and there is still something to gain there, but it sounds complicated to me. As right now I do not have much time to dedicate to this project, and there are other challenging issues to solve which are more important, I will park this for now.

IF readstat supports that access pattern, I am pretty sure it is possible to achieve massive gains with the default chunk size. However, given your limited time, the relatively easy workaround and the fact that the needed changes would be large, I totally understand your position. Maybe adding a mention of this problem somewhere in a performance section (or elsewhere) would help?

gdementen avatar May 14 '25 17:05 gdementen

also, just out of curiosity, you said initially the file took 25 hours to complete. How long is it taking now? what is the chunksize and how many cores are you using?

Haven't tried multiprocessing yet. The goal was to convert the file to parquet, so now that it is done, I don't really need to do it again before next year when we'll receive new data. Just for my (and your) curiosity I will try to run it later this week (if I don't forget).

gdementen avatar May 14 '25 17:05 gdementen

My thinking is that continuously giving chunks to multiprocessing will not help too much, because the idea of chunking is that you cannot handle the whole file in RAM and therefore can process only part of the file at a time. Therefore if you set your chunk to be as large as possible as your RAM allows, then it makes sense to divide the file to pass it to multiprocessing for parallelization, but it does not make sense to pass it in several chunks (anyway the sum of all chunks cannot be larger than the parent chunk because of RAM ), rather do it only once as we are doing now.

I have added a sentence in the README in the multiprocessing pointing that you can combine with chunking for clarity.

ofajardo avatar May 17 '25 10:05 ofajardo