pyreadstat
pyreadstat copied to clipboard
Improve SAS7BDAT reader performance
I have worked on improving Pandas' SAS7BDAT reader performance in past couple of days. pyreadstat was a great source of reference to fix some bugs, thanks a lot for making this project!
I realized that the Pandas reader is much faster at least for the type of files I'm using.
As an example, for a large file with a bunch of millions of rows, pyreadstat takes around 20 s to read (both in chunked and non-chunked mode), while the upstream Pandas parser takes around 1 s and my optimized parser (at https://github.com/jonashaag/pandas/tree/fast-sas7bdat-2) takes around 50 ms.
It looks like the bulk of the time is spent in RLE decompression and scanning pages.
Unfortunately I can't share the file but maybe someone else has a publishable example file.
wow! that's a major improvement, congrats!
I wonder what kind of files are those where the pandas parser is faster than pyreadstat (and your new parser even faster). With my own files it was always the opposite, so I guess there is some special characteristic that makes these files be that way.
In any case for me it is not that simple to improve that situation, as I am wrapping Readstat, so that one would have to be fixed first. I guess with an example file and knowing what is going on one could submit a request to Readstat, but since you are fixing it on pandas, I guess the motivation is not that high.
You are right, I'm going to report this upstream.
As I cannot action anything here until eventually there are some changes in Readstat, I am going to close this for now.