polars Performance degradation caused by presence of optional numpy dependency

First of all, thanks for making polars available! It turned out to be a blazing fast alternative to pandas when doing pre-processing for real-time ML inference. True game changer!

We run an internal benchmark to flag possible performance regressions as part of testing new polars releases. Starting with 0.15.9, I noticed a minor performance regression that affects DataFrame initialization from data: dict[str, tuple[float | str | None, ...]]. After some digging, I believe this is caused by #5918 which dramatically increases calls to _NUMPY_TYPE().

I was surprised by the performance impact of _NUMPY_TYPE():

without numpy being installed, the impact is minimal. Great!
with numpy being merely installed, the impact grows by a factor of 27x - regardless of whether numpy is actually used. Not so great!

There is a trivial hack that speeds up DataFrame initialization and our benchmark by ~10%, well offsetting the initially discovered regression:

import polars.dependencies polars.dependencies._NUMPY_AVAILABLE = False

Would you consider supporting a corresponding fast path for users that are also working with polars and numpy, just not in combination?

A few potential approaches come to mind:

Refactor polars.dependencies._NUMPY_AVAILABLE and make it available as part of the API - for users to overwrite manually when required. Something like polars.NO_NUMPY = True. -> Still feels hacky
Make polars.dependencies._NUMPY_AVAILABLE user configurable, e.g. by supporting optional environment variables or config files that may be used to trigger an override of polars.dependencies._NUMPY_AVAILABLE. -> I would be happy to use an environment variable, such as PY_POLARS_NO_NUMPY=1
Understand and accept performance penalty of polars.dependencies._NUMPY_AVAILABLE and polars.dependencies._NUMPY_TYPE() and potentially reduce reliance on those throughout the code-base. -> high effort

BTW: The same issue applies to similar constants, such as polars.dependencies._PANDAS_AVAILABLE, which might affect other use cases, too.

Jan 04 '23 21:01 jakob-keller

I can take a look at this; do you have a suggested/straightforward example showing something particularly awful? Or literally just init from a column of basic sequence types? I'll start experimenting, but if there's something especially egregious then let me know.

I've already got some ideas for speedups & better logic, so I'm quite sure we can address this without any hackery being required (I also believe this is the only place where a tight loop of such checks exist; everywhere else these lazy-typecheck functions are only called once/twice per method, so have no impact).

Jan 05 '23 06:01 alexander-beedie

do you have a suggested/straightforward example showing something particularly awful?

Here is one trivial way to reproduce the ~10% performance penalty:

Base case: python3 -m timeit -s "import polars.dependencies; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)" 200 loops, best of 5: 1.22 msec per loop

Performance "hack": python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)" 200 loops, best of 5: 1.1 msec per loop

By the way, pandas equivalent is 10x slower: 🐌 python3 -m timeit -s "import pandas; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "pandas.DataFrame(data)" 20 loops, best of 5: 9.67 msec per loop

Mac mini (M1, 2020), macOS Ventura 13.1, CPython 3.10.9, polars 0.15.11

Jan 05 '23 08:01 jakob-keller

Maybe it is also an idea to add specialized constructors to the public interface that run almost no python bytecode?

This might be useful for users than need to create many very small DataFrames.

Jan 05 '23 09:01 ritchie46

On a related note: I would have expected that specifying column data types via the columns argument might speed up DataFrame initialization, since type inference could by avoided. In reality, it adds a significant penalty, ~10-15% in trivial cases.

This should a separate issue, but might influence potential work on improvements such as spezialized constructors.

Let me know and I would be happy to elaborate and provide an example is a separate issue.

Jan 05 '23 09:01 jakob-keller

I've got a small change almost ready to go that makes the _NUMPY_TYPE check ~8x faster (running in an isolated loop), but ideally I'm going to make it faster still. Then I'll think about some more structural improvements to our highly-overloaded frame-init, and also take a look at the apparent schema/type-inference slowdown (@jakob-keller; got a particular example of that one too?)

@ritchie46: a dedicated low-overhead init path for basic/key data types sounds great :)

Jan 05 '23 10:01 alexander-beedie

Wow, you guys are as fast as polars itself ;-)

(@jakob-keller; got a particular example of that one too?)

Sure thing:

Without columns argument: python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data)" 1000 loops, best of 5: 325 usec per loop

With columns argument: python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data, columns)" 1000 loops, best of 5: 364 usec per loop

I believe the issue might be related to the fact that the columns argument always requires column names to be provided. Maybe this triggers renaming of any Series produced (i.e. rename()), even if the name is already correctly set.

Please note that results reverse without polars.dependencies._NUMPY_AVAILABLE = False. It appears that this masks the underlying issue, if any. Therefore, I decided against opening a separate issue for now.

Jan 05 '23 10:01 jakob-keller

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

Jan 05 '23 13:01 jakob-keller

Confirmed; the linked PR fixes the regression (which was worse than I realised) and may actually make things ~5% faster than before:

with Timer():
    for i in range(1000):
        _df = pl.DataFrame( 
            { f'column_{c}': tuple( range(1000) ) for c in range(100) } 
        )

0.15.8: 2.4187 secs 0.15.9: 4.5614 secs << regression 0.15.13 (post-PR): 2.2779 secs

Now to take it further...

Jan 05 '23 14:01 alexander-beedie

Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.

No problem; thanks very much for the clean/detailed repro.

Jan 05 '23 14:01 alexander-beedie

I can confirm that polars 0.15.13 addresses the regression as expected. Big thanks to everyone involved!

Jan 05 '23 21:01 jakob-keller

polars polars copied to clipboard

Performance degradation caused by presence of optional numpy dependency

polars
polars copied to clipboard