pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pandera is very slow to import when optional dependencies are installed

Open apmorton opened this issue 1 year ago • 2 comments
trafficstars

Consider the following two environments:

---
name: test1
channels:
  - conda-forge
  - nodefaults
dependencies:
  - pandera
---
name: test2
channels:
  - conda-forge
  - nodefaults
dependencies:
  - pandera
  - pandas-stubs
  - pyspark >= 3.2.0
  - polars >= 0.20.0
  - modin
  - protobuf
  - geopandas
  - shapely
  - fastapi

And the following commands:

Pandera with no optional dependencies takes around half a second to import and uses 94 MB of memory

$ /usr/bin/time -v test1/bin/python -c 'import pandera' 2>&1 | grep -E "Elapsed|Maximum resident"
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.58
        Maximum resident set size (kbytes): 94524

Pandera with all optional dependencies takes over two seconds to import and uses 243 MB of memory

$ /usr/bin/time -v test2/bin/python -c 'import pandera' 2>&1 | grep -E "Elapsed|Maximum resident"
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.17
        Maximum resident set size (kbytes): 243776

In a large project I may have many of pandera's optional dependencies installed for one reason or another but not want to use them with pandera.

Simply importing pandera shouldn't forcefully import all optional dependencies.

apmorton avatar May 14 '24 23:05 apmorton

hi @apmorton thanks for the benchmarks. Yes! It's time to optimize this.

One approach is to use lazy module loading, similar to how flytekit does it for optional dependencies (i'm one of the maintainers of this project, and we had a similar problem), e.g. see here: https://github.com/flyteorg/flytekit/blob/master/flytekit/lazy_import/lazy_module.py. We can then use the lazy_module("<optional_package>" like so: https://github.com/flyteorg/flytekit/blob/76fb7c344162b7fae141c40c2fc8d18a71091fc2/flytekit/deck/renderer.py#L8-L14

if TYPE_CHECKING:
    # Always import these modules in type-checking mode or when running pytest
    import pandas
    import pyarrow
else:
    pandas = lazy_module("pandas")
    pyarrow = lazy_module("pyarrow")

@apmorton if you have the capacity, would you mind pointing out parts of the codebase where these optional packages are being imported? It should be pretty straightforward then to use the lazy module loader.

cosmicBboy avatar May 15 '24 19:05 cosmicBboy

Can repro this on macos:

❯ /usr/bin/time -l python -c 'import pandera' 2>&1
        1.80 real         4.17 user         2.46 sys
           215072768  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               21716  page reclaims
                3210  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                1733  voluntary context switches
               10253  involuntary context switches
         36530415133  instructions retired
         20643165531  cycles elapsed
           159648640  peak memory footprint

looking into a quickfix to unblock this

cosmicBboy avatar May 23 '24 13:05 cosmicBboy

@apmorton Okay, so changes to the way imports work have shaved down some of the import time by a little more than half and memory usage by a little less than half:

❯ /usr/bin/time -l python -c 'import pandera' 2>&1                             ▼  pandera-dev
        0.86 real         1.18 user         2.18 sys
           139972608  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               38757  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   1  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
               91980  involuntary context switches
         13941572873  instructions retired
         10446549912  cycles elapsed
            99754368  peak memory footprint

There are still various places where optional deps are imported on a top-level pandera import, like here and here.

Will keep this issue open until these numbers are shaved down to the roughly the same as a bare pandera installation

cosmicBboy avatar Jul 17 '24 20:07 cosmicBboy

Created #1753, which refactors the import structure so that optional dependencies are not imported in the top-level pandera import execution path. There's still a little bit of a difference, but the import time of pandera (if you happen to have all the optional dependencies installed) is now sub 1 second (~0.7-0.8 seconds).

Going to consider this issue addressed once the PR is merged.

cosmicBboy avatar Jul 18 '24 17:07 cosmicBboy