pyreadstat icon indicating copy to clipboard operation
pyreadstat copied to clipboard

Improve SAS7BDAT reader performance

Open jonashaag opened this issue 3 years ago • 2 comments

I have worked on improving Pandas' SAS7BDAT reader performance in past couple of days. pyreadstat was a great source of reference to fix some bugs, thanks a lot for making this project!

I realized that the Pandas reader is much faster at least for the type of files I'm using.

As an example, for a large file with a bunch of millions of rows, pyreadstat takes around 20 s to read (both in chunked and non-chunked mode), while the upstream Pandas parser takes around 1 s and my optimized parser (at https://github.com/jonashaag/pandas/tree/fast-sas7bdat-2) takes around 50 ms.

It looks like the bulk of the time is spent in RLE decompression and scanning pages.

Unfortunately I can't share the file but maybe someone else has a publishable example file.

jonashaag avatar May 30 '22 11:05 jonashaag

wow! that's a major improvement, congrats!

I wonder what kind of files are those where the pandas parser is faster than pyreadstat (and your new parser even faster). With my own files it was always the opposite, so I guess there is some special characteristic that makes these files be that way.

In any case for me it is not that simple to improve that situation, as I am wrapping Readstat, so that one would have to be fixed first. I guess with an example file and knowing what is going on one could submit a request to Readstat, but since you are fixing it on pandas, I guess the motivation is not that high.

ofajardo avatar May 30 '22 12:05 ofajardo

You are right, I'm going to report this upstream.

jonashaag avatar May 30 '22 12:05 jonashaag

As I cannot action anything here until eventually there are some changes in Readstat, I am going to close this for now.

ofajardo avatar Nov 07 '22 10:11 ofajardo