pandera
pandera copied to clipboard
Pandera is very slow to import when optional dependencies are installed
Consider the following two environments:
---
name: test1
channels:
- conda-forge
- nodefaults
dependencies:
- pandera
---
name: test2
channels:
- conda-forge
- nodefaults
dependencies:
- pandera
- pandas-stubs
- pyspark >= 3.2.0
- polars >= 0.20.0
- modin
- protobuf
- geopandas
- shapely
- fastapi
And the following commands:
Pandera with no optional dependencies takes around half a second to import and uses 94 MB of memory
$ /usr/bin/time -v test1/bin/python -c 'import pandera' 2>&1 | grep -E "Elapsed|Maximum resident"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.58
Maximum resident set size (kbytes): 94524
Pandera with all optional dependencies takes over two seconds to import and uses 243 MB of memory
$ /usr/bin/time -v test2/bin/python -c 'import pandera' 2>&1 | grep -E "Elapsed|Maximum resident"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.17
Maximum resident set size (kbytes): 243776
In a large project I may have many of pandera's optional dependencies installed for one reason or another but not want to use them with pandera.
Simply importing pandera shouldn't forcefully import all optional dependencies.
hi @apmorton thanks for the benchmarks. Yes! It's time to optimize this.
One approach is to use lazy module loading, similar to how flytekit does it for optional dependencies (i'm one of the maintainers of this project, and we had a similar problem), e.g. see here: https://github.com/flyteorg/flytekit/blob/master/flytekit/lazy_import/lazy_module.py. We can then use the lazy_module("<optional_package>" like so:
https://github.com/flyteorg/flytekit/blob/76fb7c344162b7fae141c40c2fc8d18a71091fc2/flytekit/deck/renderer.py#L8-L14
if TYPE_CHECKING:
# Always import these modules in type-checking mode or when running pytest
import pandas
import pyarrow
else:
pandas = lazy_module("pandas")
pyarrow = lazy_module("pyarrow")
@apmorton if you have the capacity, would you mind pointing out parts of the codebase where these optional packages are being imported? It should be pretty straightforward then to use the lazy module loader.
Can repro this on macos:
❯ /usr/bin/time -l python -c 'import pandera' 2>&1
1.80 real 4.17 user 2.46 sys
215072768 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21716 page reclaims
3210 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
1733 voluntary context switches
10253 involuntary context switches
36530415133 instructions retired
20643165531 cycles elapsed
159648640 peak memory footprint
looking into a quickfix to unblock this
@apmorton Okay, so changes to the way imports work have shaved down some of the import time by a little more than half and memory usage by a little less than half:
❯ /usr/bin/time -l python -c 'import pandera' 2>&1 ▼ pandera-dev
0.86 real 1.18 user 2.18 sys
139972608 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
38757 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
1 messages sent
0 messages received
0 signals received
0 voluntary context switches
91980 involuntary context switches
13941572873 instructions retired
10446549912 cycles elapsed
99754368 peak memory footprint
There are still various places where optional deps are imported on a top-level pandera import, like here and here.
Will keep this issue open until these numbers are shaved down to the roughly the same as a bare pandera installation
Created #1753, which refactors the import structure so that optional dependencies are not imported in the top-level pandera import execution path. There's still a little bit of a difference, but the import time of pandera (if you happen to have all the optional dependencies installed) is now sub 1 second (~0.7-0.8 seconds).
Going to consider this issue addressed once the PR is merged.