pandas icon indicating copy to clipboard operation
pandas copied to clipboard

SAS7BDAT parser: Fast byteswap

Open jonashaag opened this issue 3 years ago • 7 comments

Speed up SAS7BDAT int/float reading.

This is order of magnitude faster than using struct.unpack(fmt, data) or precompiled_unpacker = struct.Struct(fmt).unpack; ...; precompiled_unpacker(data).

Unfortunately Python does not expose a low-level interface to struct or a byteswapping interface. The byteswap implementation in this change is from pyreadstat.

Today this brings a modest 10-20% performance improvement. But together with the other changes I will be proposing it will be a major bottleneck.

  • [ ] closes #xxxx (Replace xxxx with the Github issue number)
  • [ ] Tests added and passed if fixing a bug or adding a new feature
  • [ ] All code checks passed.
  • [ ] Added type annotations to new arguments/methods/functions.
  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jonashaag avatar Jun 17 '22 11:06 jonashaag

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

jonashaag avatar Jul 04 '22 14:07 jonashaag

One thing I realized is that we could also use NumPy's byteswapping, at the cost of around 10% performance relative to this impl.

How much of this could be replaced with that?

jbrockmendel avatar Jul 06 '22 21:07 jbrockmendel

How much of this could be replaced with that?

Everything copied from readstat so ~ 50 lines of Cython code

jonashaag avatar Jul 07 '22 06:07 jonashaag

ASV from #47405:

       before           after         ratio
     [e915b0a4]       [435a003c]
     <main>           <sas/byteswap~1>
-      81.9±0.7ms       73.7±0.3ms     0.90  io.sas.SAS.time_read_sas7bdat_2_chunked
-      78.7±0.7ms       69.5±0.6ms     0.88  io.sas.SAS.time_read_sas7bdat_2

jonashaag avatar Jul 09 '22 07:07 jonashaag

Test failure seems unrelated

jonashaag avatar Jul 11 '22 17:07 jonashaag

I'm trying an intrinsics-based version: Fewer SLOC and faster.

jonashaag avatar Jul 21 '22 07:07 jonashaag

Test failure is unrelated. @jreback merge?

jonashaag avatar Jul 22 '22 09:07 jonashaag

@jonashaag seems like you didn't add the release note as requested, do you mind adding it?

Does it make sense to add a unit test for the new functions?

Also if you can please merge master.

datapythonista avatar Sep 10 '22 12:09 datapythonista

@datapythonista release notes are already merged with #47404. Will merge main.

jonashaag avatar Sep 10 '22 15:09 jonashaag

Do you want me to add unit tests?

jonashaag avatar Sep 10 '22 15:09 jonashaag

Do you want me to add unit tests?

Personally I think it should be useful. Probably just one parametrized to test the function for every type is enough.

datapythonista avatar Sep 10 '22 15:09 datapythonista

The release note in the other PR is not accurate now. The changes in this PR won't be available until pandas 1.6/2.0, while for what you mention the release note in the other PR mentions this issue/PR for 1.5. Probably no big deal, but if you want to open a small PR to update that note in the releases to only include the issues/PRs that have already been merged, that would leave things more accurate (ping me in that PR, so I backport it to 1.5). And then you can add another release note for 1.6 here.

datapythonista avatar Sep 10 '22 15:09 datapythonista

Or we merge the PRs noted in those release notes to 1.5? The PRs have been ready for review with no code changes for a long time.

jonashaag avatar Sep 10 '22 15:09 jonashaag

We already released pandas 1.5 release candidate, and we are only backporting regressions to it, not new features, performance improvements. It's also unclear to me this will be merged before the release.

I know this has been forgotten for a long time, sorry about that. But that's unfortunately part of how open source development works in a project like pandas.

datapythonista avatar Sep 10 '22 15:09 datapythonista

I added tests using Hypothesis and moved the byteswapping code to a module. The byteswapping code also be moved somewhere else outside the SAS stuff.

jonashaag avatar Sep 10 '22 22:09 jonashaag

Updated release notes. I took the liberty of including #47656 already.

jonashaag avatar Sep 15 '22 09:09 jonashaag

Thanks @jonashaag

mroeschke avatar Oct 05 '22 16:10 mroeschke