polars
polars copied to clipboard
Performance degradation caused by presence of optional numpy dependency
First of all, thanks for making polars available! It turned out to be a blazing fast alternative to pandas when doing pre-processing for real-time ML inference. True game changer!
We run an internal benchmark to flag possible performance regressions as part of testing new polars releases. Starting with 0.15.9, I noticed a minor performance regression that affects DataFrame initialization from data: dict[str, tuple[float | str | None, ...]]
. After some digging, I believe this is caused by #5918 which dramatically increases calls to _NUMPY_TYPE()
.
I was surprised by the performance impact of _NUMPY_TYPE()
:
- without
numpy
being installed, the impact is minimal. Great! - with
numpy
being merely installed, the impact grows by a factor of 27x - regardless of whether numpy is actually used. Not so great!
There is a trivial hack that speeds up DataFrame initialization and our benchmark by ~10%, well offsetting the initially discovered regression:
import polars.dependencies
polars.dependencies._NUMPY_AVAILABLE = False
Would you consider supporting a corresponding fast path for users that are also working with polars
and numpy
, just not in combination?
A few potential approaches come to mind:
- Refactor
polars.dependencies._NUMPY_AVAILABLE
and make it available as part of the API - for users to overwrite manually when required. Something likepolars.NO_NUMPY = True
. -> Still feels hacky - Make
polars.dependencies._NUMPY_AVAILABLE
user configurable, e.g. by supporting optional environment variables or config files that may be used to trigger an override ofpolars.dependencies._NUMPY_AVAILABLE
. -> I would be happy to use an environment variable, such asPY_POLARS_NO_NUMPY=1
- Understand and accept performance penalty of
polars.dependencies._NUMPY_AVAILABLE
andpolars.dependencies._NUMPY_TYPE()
and potentially reduce reliance on those throughout the code-base. -> high effort
BTW: The same issue applies to similar constants, such as polars.dependencies._PANDAS_AVAILABLE
, which might affect other use cases, too.
I can take a look at this; do you have a suggested/straightforward example showing something particularly awful? Or literally just init from a column of basic sequence types? I'll start experimenting, but if there's something especially egregious then let me know.
I've already got some ideas for speedups & better logic, so I'm quite sure we can address this without any hackery being required (I also believe this is the only place where a tight loop of such checks exist; everywhere else these lazy-typecheck functions are only called once/twice per method, so have no impact).
do you have a suggested/straightforward example showing something particularly awful?
Here is one trivial way to reproduce the ~10% performance penalty:
Base case:
python3 -m timeit -s "import polars.dependencies; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.22 msec per loop
Performance "hack":
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "polars.DataFrame(data)"
200 loops, best of 5: 1.1 msec per loop
By the way, pandas equivalent is 10x slower: 🐌
python3 -m timeit -s "import pandas; data = {f'column_{c}': tuple(range(1000)) for c in range(100)}" "pandas.DataFrame(data)"
20 loops, best of 5: 9.67 msec per loop
Mac mini (M1, 2020), macOS Ventura 13.1, CPython 3.10.9, polars 0.15.11
Maybe it is also an idea to add specialized constructors to the public interface that run almost no python bytecode?
This might be useful for users than need to create many very small DataFrames.
On a related note: I would have expected that specifying column data types via the columns
argument might speed up DataFrame initialization, since type inference could by avoided. In reality, it adds a significant penalty, ~10-15% in trivial cases.
This should a separate issue, but might influence potential work on improvements such as spezialized constructors.
Let me know and I would be happy to elaborate and provide an example is a separate issue.
I've got a small change almost ready to go that makes the _NUMPY_TYPE
check ~8x faster (running in an isolated loop), but ideally I'm going to make it faster still. Then I'll think about some more structural improvements to our highly-overloaded frame-init, and also take a look at the apparent schema/type-inference slowdown (@jakob-keller; got a particular example of that one too?)
@ritchie46: a dedicated low-overhead init path for basic/key data types sounds great :)
Wow, you guys are as fast as polars itself ;-)
(@jakob-keller; got a particular example of that one too?)
Sure thing:
Without columns
argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data)"
1000 loops, best of 5: 325 usec per loop
With columns
argument:
python3 -m timeit -s "import polars.dependencies; polars.dependencies._NUMPY_AVAILABLE = False; data = {f'column_{c}': tuple(range(100)) for c in range(100)}; columns = tuple((column, polars.Int64) for column in data)" "polars.from_dict(data, columns)"
1000 loops, best of 5: 364 usec per loop
I believe the issue might be related to the fact that the columns argument always requires column names to be provided. Maybe this triggers renaming of any Series produced (i.e. rename()), even if the name is already correctly set.
Please note that results reverse without polars.dependencies._NUMPY_AVAILABLE = False
. It appears that this masks the underlying issue, if any. Therefore, I decided against opening a separate issue for now.
Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.
Confirmed; the linked PR fixes the regression (which was worse than I realised) and may actually make things ~5% faster than before:
with Timer():
for i in range(1000):
_df = pl.DataFrame(
{ f'column_{c}': tuple( range(1000) ) for c in range(100) }
)
0.15.8
: 2.4187 secs
0.15.9
: 4.5614 secs << regression
0.15.13
(post-PR): 2.2779 secs
Now to take it further...
Fantastic job! I will confirm and close as soon as 0.15.12 lands. Feel free to close earlier at your convenience.
No problem; thanks very much for the clean/detailed repro.
I can confirm that polars 0.15.13 addresses the regression as expected. Big thanks to everyone involved!