SAS7BDAT parser: Speed up RLE/RDC decompression
Speed up RLE/RDC decompression. Brings a 30-50% performance improvement on SAS7BDAT files using compression.
Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.
Also adds a bunch of assert statements to avoid illegal reads/writes. These slow the code down considerably; I will try to improve on that in a future PR.
Alternatives considered:
-
Fast NumPy array creation: Didn't find a way to do it.
-
Using Python's
bytearray: Much slower. -
Using
array.array: Much slower. Cython has a fast path but it is incompatible with PyPy. -
[ ] closes #xxxx (Replace xxxx with the Github issue number)
-
[ ] Tests added and passed if fixing a bug or adding a new feature
-
[ ] All code checks passed.
-
[ ] Added type annotations to new arguments/methods/functions.
-
[ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
@jbrockmendel mind reviewing this as well?
@jonashaag thanks for your patience; im just coming off of a semi-vacation, starting to dig into the ping backlog now.
Fast NumPy array creation: Didn't find a way to do it.
Which usage would you need to replace?
cc @WillAyd
Fast NumPy array creation: Didn't find a way to do it.
Which usage would you need to replace?
Essentially the call to calloc. Cython will always call into NumPy and that will be done thousands/millions of times for a SAS file.
can you also add a whatsnew note
@jonashaag can you rebase.
@jbrockmendel ok here?
@jbrockmendel mind to review this? thanks! :)
Works by avoiding calls into NumPy array creation and using a custom-built buffer instead.
where is the ndarray creation that is so expensive? i dont have any real objection here, but am not wild about introducing a new class/struct whose methods are glorified getitem/setitem.
fine by me
@mroeschke FYI the What's New for 1.5 already include this PR and #47403, but we haven't merged so far.
Sorry this and the other PR flew under the radar during the 1.5.0.rc release. I agree with @datapythonista as mentioned in https://github.com/pandas-dev/pandas/pull/47403#issuecomment-1242755217 and I think these would be more suitable for 1.6/2.0
@mroeschke can we please merge this together with #47403 and #47656
It’s in the other Pr
It’s in the other Pr
Any particular order these PRs should be reviewed/merged? I haven't been in the loop with these PR much and it seems like they contain items relevant to other PRs (like that whatsnew). If they are completely independent (including the whatnew), I think it might be easier to review
Feel free to merge in any order. I can fix any conflicts. Making separate what’s new will require a conflict resolution on each PR after each merge
Code changes are independent, just the what’s new is in one PR to avoid conflicts
Thanks @jonashaag