pandas icon indicating copy to clipboard operation
pandas copied to clipboard

SAS7BDAT parser: Speed up RLE/RDC decompression

Open jonashaag opened this issue 3 years ago • 8 comments

Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

Also adds a bunch of assert statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.

Alternatives considered:

  • Fast NumPy array creation: Didn't find a way to do it.

  • Using Python's bytearray: Much slower.

  • Using array.array: Much slower. Cython has a fast path but it is incompatible with PyPy.

  • [ ] closes #xxxx (Replace xxxx with the Github issue number)

  • [ ] Tests added and passed if fixing a bug or adding a new feature

  • [ ] All code checks passed.

  • [ ] Added type annotations to new arguments/methods/functions.

  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jonashaag avatar Jun 17 '22 11:06 jonashaag

@jbrockmendel mind reviewing this as well?

jonashaag avatar Jun 27 '22 18:06 jonashaag

@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now.

jbrockmendel avatar Jul 06 '22 20:07 jbrockmendel

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

jbrockmendel avatar Jul 06 '22 20:07 jbrockmendel

cc @WillAyd

jbrockmendel avatar Jul 06 '22 20:07 jbrockmendel

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

Essentially the call to calloc. Cython will always call into NumPy and that will be done thousands/millions of times for a SAS file.

jonashaag avatar Jul 07 '22 06:07 jonashaag

can you also add a whatsnew note

jreback avatar Jul 08 '22 23:07 jreback

@jonashaag can you rebase.

@jbrockmendel ok here?

jreback avatar Jul 10 '22 15:07 jreback

@jbrockmendel mind to review this? thanks! :)

jonashaag avatar Aug 08 '22 07:08 jonashaag

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem.

jbrockmendel avatar Aug 25 '22 23:08 jbrockmendel

fine by me

jbrockmendel avatar Aug 29 '22 18:08 jbrockmendel

@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far.

jonashaag avatar Sep 09 '22 18:09 jonashaag

Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in https://github.com/pandas-dev/pandas/pull/47403#issuecomment-1242755217 and I think these would be more suitable for 1.6/2.0

mroeschke avatar Sep 12 '22 19:09 mroeschke

@mroeschke can we please merge this together with #47403 and #47656

jonashaag avatar Sep 30 '22 11:09 jonashaag

It’s in the other Pr

jonashaag avatar Oct 03 '22 18:10 jonashaag

It’s in the other Pr

Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review

mroeschke avatar Oct 03 '22 18:10 mroeschke

Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge

jonashaag avatar Oct 03 '22 20:10 jonashaag

Code changes are independent, just the what’s new is in one PR to avoid conflicts

jonashaag avatar Oct 03 '22 20:10 jonashaag

Thanks @jonashaag

mroeschke avatar Oct 03 '22 21:10 mroeschke