pandas icon indicating copy to clipboard operation
pandas copied to clipboard

SAS7BDAT parser: Improve subheader lookup performance

Open jonashaag opened this issue 3 years ago • 3 comments

Avoid constructing _SubheaderPointer objects and make dictionary lookups in C rather than in Python.

Speedup relative to current main:

     <main>           <sas/shlookup~1>
-     8.32±0.07ms      7.51±0.06ms     0.90  io.sas.SAS.time_test_meta2_pagesas7bdat
-      82.8±0.5ms       73.6±0.5ms     0.89  io.sas.SAS.time_read_sas7bdat_2_chunked
       before           after         ratio

Will extend what's new from https://github.com/pandas-dev/pandas/pull/47404 once that's merged.

  • [ ] closes #xxxx (Replace xxxx with the Github issue number)
  • [ ] Tests added and passed if fixing a bug or adding a new feature
  • [ ] All code checks passed.
  • [ ] Added type annotations to new arguments/methods/functions.
  • [ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jonashaag avatar Jul 09 '22 21:07 jonashaag

Test failure seems unrelated to this PR but still worth fixing. I'll need some guidance here. The problem is that we don't have an explicit .close() when using pd.read_sas(iterator=True) and so a file descriptor is leaked. But file handling in read_sas is implemented the same way it is implemented in pd.read_csv(iterator=True), which doesn't cause leaked file descriptors. So I wonder what's wrong here.

jonashaag avatar Jul 12 '22 15:07 jonashaag

@mroeschke any pointers for the failing test (see comment above)?

jonashaag avatar Aug 08 '22 21:08 jonashaag

Sorry, I'm not too familiar with this part of the codebase. It looks like it's using the shared file handling code, but not sure if the one of the sas is still keeping a reference to the file handle somewhere?

mroeschke avatar Aug 09 '22 02:08 mroeschke

Bug in file_leak_context, see #30096.

jonashaag avatar Aug 22 '22 09:08 jonashaag

@mroeschke tests fixed, should be good to review now.

jonashaag avatar Sep 02 '22 18:09 jonashaag

Thanks @jonashaag

mroeschke avatar Oct 04 '22 17:10 mroeschke