SAS7BDAT parser: Improve subheader lookup performance
Avoid constructing _SubheaderPointer objects and make dictionary lookups in C rather than in Python.
Speedup relative to current main:
<main> <sas/shlookup~1>
- 8.32±0.07ms 7.51±0.06ms 0.90 io.sas.SAS.time_test_meta2_pagesas7bdat
- 82.8±0.5ms 73.6±0.5ms 0.89 io.sas.SAS.time_read_sas7bdat_2_chunked
before after ratio
Will extend what's new from https://github.com/pandas-dev/pandas/pull/47404 once that's merged.
- [ ] closes #xxxx (Replace xxxx with the Github issue number)
- [ ] Tests added and passed if fixing a bug or adding a new feature
- [ ] All code checks passed.
- [ ] Added type annotations to new arguments/methods/functions.
- [ ] Added an entry in the latest
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.
Test failure seems unrelated to this PR but still worth fixing. I'll need some guidance here. The problem is that we don't have an explicit .close() when using pd.read_sas(iterator=True) and so a file descriptor is leaked. But file handling in read_sas is implemented the same way it is implemented in pd.read_csv(iterator=True), which doesn't cause leaked file descriptors. So I wonder what's wrong here.
@mroeschke any pointers for the failing test (see comment above)?
Sorry, I'm not too familiar with this part of the codebase. It looks like it's using the shared file handling code, but not sure if the one of the sas is still keeping a reference to the file handle somewhere?
Bug in file_leak_context, see #30096.
@mroeschke tests fixed, should be good to review now.
Thanks @jonashaag