pandas SAS7BDAT parser: Speed up RLE/RDC decompression

Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

Also adds a bunch of assert statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.

Alternatives considered:

Fast NumPy array creation: Didn't find a way to do it.
Using Python's bytearray: Much slower.
Using array.array: Much slower. Cython has a fast path but it is incompatible with PyPy.
[ ] closes #xxxx (Replace xxxx with the Github issue number)
[ ] Tests added and passed if fixing a bug or adding a new feature
[ ] All code checks passed.
[ ] Added type annotations to new arguments/methods/functions.
[ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Jun 17 '22 11:06 jonashaag

@jbrockmendel mind reviewing this as well?

Jun 27 '22 18:06 jonashaag

@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now.

Jul 06 '22 20:07 jbrockmendel

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

Jul 06 '22 20:07 jbrockmendel

cc @WillAyd

Jul 06 '22 20:07 jbrockmendel

Fast NumPy array creation: Didn't find a way to do it.

Which usage would you need to replace?

Essentially the call to calloc. Cython will always call into NumPy and that will be done thousands/millions of times for a SAS file.

Jul 07 '22 06:07 jonashaag

can you also add a whatsnew note

Jul 08 '22 23:07 jreback

@jonashaag can you rebase.

@jbrockmendel ok here?

Jul 10 '22 15:07 jreback

@jbrockmendel mind to review this? thanks! :)

Aug 08 '22 07:08 jonashaag

Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.

where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem.

Aug 25 '22 23:08 jbrockmendel

fine by me

Aug 29 '22 18:08 jbrockmendel

@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far.

Sep 09 '22 18:09 jonashaag

Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in https://github.com/pandas-dev/pandas/pull/47403#issuecomment-1242755217 and I think these would be more suitable for 1.6/2.0

Sep 12 '22 19:09 mroeschke

@mroeschke can we please merge this together with #47403 and #47656

Sep 30 '22 11:09 jonashaag

It’s in the other Pr

Oct 03 '22 18:10 jonashaag

It’s in the other Pr

Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review

Oct 03 '22 18:10 mroeschke

Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge

Oct 03 '22 20:10 jonashaag

Code changes are independent, just the what’s new is in one PR to avoid conflicts

Oct 03 '22 20:10 jonashaag

Thanks @jonashaag

Oct 03 '22 21:10 mroeschke