polars Optimize python imports

Problem Description

Our optional dependencies influence our import time. Pandas import is ~500ms!!

Import times:

only polars installed:
0.268s
polars + pandas installed
0.755s

We should explore if we can do this lazily.

Aug 19 '22 07:08 ritchie46

I remember that numpy used to take around 1 second to import in the past (which is quite noticeble when you just want to display the help of a script). So it might be that most of the pandas time comes from this.

Aug 19 '22 08:08 ghuls

importlib.set_lazy_imports() looks interesting.

Aug 19 '22 08:08 ghuls

I remember that numpy used to take around 1 second to import in the past (which is quite noticeble when you just want to display the help of a script). So it might be that most of the pandas time comes from this.

I think we should lazy import all of these. I expect pyarrow also does non trivial work. Our polars.testing also seems unnecessary.

importlib.set_lazy_imports() looks interesting.

Indeed!

Aug 19 '22 08:08 ritchie46

There is also pyforest which does not require importing the modules yourself at all, till you call the function: https://medium.com/@virajmane02/pyforest-lazy-import-of-python-data-science-libraries-72428be0299b But I assume it will not play nicely with typechecking and linting if enabled unconditionally.

Aug 19 '22 08:08 ghuls

There is also pyforest which does not require importing the modules yourself at all, till you call the function: https://medium.com/@virajmane02/pyforest-lazy-import-of-python-data-science-libraries-72428be0299b But I assume it will not play nicely with typechecking and linting if enabled unconditionally.

I don't think I like that idea. I think we could do something like this:

if TYPE_CHECKING:
    import numpy as np
    
def numpy_available() -> bool:
    try:
        import numpy
        return True
    except ImportError:
        return False

def get_np() -> np:
    import numpy

Which is our current behavior behind function calls. If I am correct we only pay the import cost once.

Aug 19 '22 08:08 ritchie46

importlib.set_lazy_imports() looks interesting.

Cool, I didn't know about this PEP! Thanks for linking it.

Taking advantage of this would take some thought and rewriting, though. Enabling this would indeed improve the initial import time, but you'd 'pay back' that time upon the first invocation of a class/function in a module where pandas is imported. And since we import pandas in most of the major modules (io, frame, series), you won't get around it easily.

Aug 19 '22 09:08 stinodego

Taking advantage of this would take some thought and rewriting, though. Enabling this would indeed improve the initial import time, but you'd 'pay back' that time upon the first invocation of a class/function in a module where pandas is imported. And since we import pandas in most of the major modules (io, frame, series), you won't get around it easily.

But we rarely use it. Only on certain branches if all other branches fail. You only pay back this import time when you:

pass a pandas dataframe (then the user already imported pandas, so no payback at all).
call to_pandas() (which might pay the import cost once if the user did not import it already).

Both cases are far better than always paying it if it is installed in the environment.

Aug 19 '22 10:08 ritchie46

Lazy imports are great - the only issue is where you want to know up-front if a certain module is available (eg: setting _PANDAS_AVAILABLE, etc), as that is currently determined by actually trying the import. (There are a few different techniques for doing lazy import; some can actually add overhead to each call into the lazy-loaded module, so good to be aware of that and check/benchmark beforehand).

Aug 19 '22 10:08 alexander-beedie

Lazy imports are great - the only issue is where you want to know up-front if a certain module is available (eg: setting _PANDAS_AVAILABLE, etc), as that is currently determined by actually trying the import. (There are a few different techniques for doing lazy import; some can actually add overhead to each call into the lazy-loaded module, so good to be aware of that and check/benchmark beforehand).

Could we get that information without importing those modules?

Aug 19 '22 11:08 ritchie46

importlib.util.find_spec("pandas") might be a solution to set _PANDAS_AVAILABLE:

In [1]: import importlib

In [2]: %time importlib.util.find_spec("numpy")
CPU times: user 204 µs, sys: 12 µs, total: 216 µs
Wall time: 146 µs
Out[2]: ModuleSpec(name='numpy', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f50f10df280>, origin='/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/numpy/__init__.py', submodule_search_locations=['/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/numpy'])

In [3]: %time importlib.util.find_spec("numpy")
CPU times: user 356 µs, sys: 0 ns, total: 356 µs
Wall time: 231 µs
Out[3]: ModuleSpec(name='numpy', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f50eb04a4c0>, origin='/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/numpy/__init__.py', submodule_search_locations=['/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/numpy'])

In [4]: %time importlib.util.find_spec("pandas")
CPU times: user 212 µs, sys: 12 µs, total: 224 µs
Wall time: 154 µs
Out[4]: ModuleSpec(name='pandas', loader=<_frozen_importlib_external.SourceFileLoader object at 0x7f50f10df0a0>, origin='/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/pandas/__init__.py', submodule_search_locations=['/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/polars_test/lib/python3.9/site-packages/pandas'])

In [5]: %time importlib.util.find_spec("pandas2")
CPU times: user 175 µs, sys: 0 ns, total: 175 µs
Wall time: 153 µs

In [6]: %time import numpy
CPU times: user 173 ms, sys: 566 ms, total: 739 ms
Wall time: 91 ms

In [7]: %time import pandas
CPU times: user 134 ms, sys: 5.35 ms, total: 140 ms
Wall time: 157 ms

Aug 19 '22 11:08 ghuls

Working on it.

Aug 19 '22 12:08 ghuls

hi. I just want to remark that pruning these imports would make polars significantly more useful for building cli tools. in the context of cli, startup times + import times matter a lot

on my somewhat fast machine polars 0.14.21 takes >900ms to import. most of these imports seem to be related to either polars.testing or pandas. if these imports could be eliminated or made lazy, this would lead to a significantly better cli ux Screenshot from 2022-10-20 23-10-19

I maintain a few different cli tools that require dataframe operations. I would prefer to switch them over from pandas to polars, but pandas is currently net faster for many small operations because of polars' longer import times

I don't know whether this is a priority right now but just wanted to share this perspective

Oct 21 '22 06:10 sslivkoff

I will take a look.

Oct 21 '22 06:10 ritchie46

mes + import times matter a lot

How do you get that visualization? I try with tuna, but I don't get that depth?

Oct 21 '22 07:10 ritchie46

hmm I'm just using tuna 0.5.11, nothing special (and I was testing polars 0.14.21). clicking on entries will zoom in on them

btw Ive found it helpful to put:

function importtime() {
    python3 -X importtime -c "import $1" 2>/tmp/tuna.log;
    tuna /tmp/tuna.log;
}

inside my bash .profile so that I can type importtime <package_name> to quickly test packages. thats the command i used to generate the plot above. happy to run more tests or benchmarks if you think it would be helpful

Oct 21 '22 07:10 sslivkoff

Thanks! That works.

Oct 21 '22 08:10 ritchie46

incredible stuff 🍾🍾

Oct 21 '22 15:10 sslivkoff

@ritchie46: FYI - I'm doing a second pass through this; got a nice way to make it more idiomatic/seamless (eg: no need for lazy_isinstance, can just use regular isinstance, etc) without giving up any of the fantastic gains. Will post a PR when I've finished working through it and confirmed the timings with a tuna plot).

Oct 22 '22 13:10 alexander-beedie

@ritchie46: FYI - I'm doing a second pass through this; got a nice way to make it more idiomatic/seamless (eg: no need for lazy_isinstance, can just use regular isinstance, etc) without giving up any of the fantastic gains. Will post a PR when I've finished working through it and confirmed the timings with a tuna plot).

Ooh.. I am curious what you came up with.

Oct 22 '22 13:10 ritchie46

Ooh.. I am curious what you came up with.

Only four unit test failures away from finding out; almost there ;)

Update: almost finished [#5302] one minor sphinx issue to go - everything else now looking good.

Oct 22 '22 13:10 alexander-beedie

@ritchie46 Done! 😅

@ghuls helpfully pointed out that my first pass wasn't properly replicating the lazy_instance behaviour; did another pass and tweaked things so that full laziness of the guard/pattern is still preserved, but using a regular isinstance. Addressed remaining lint/docs/tests and rebased against the latest code...

Oct 23 '22 10:10 alexander-beedie

I've create a package for this purpse, please checkout https://github.com/anhvth/lazy_module

Nov 11 '22 05:11 anhvth

polars polars copied to clipboard

Optimize python imports

Problem Description

polars
polars copied to clipboard