polars
polars copied to clipboard
Polars to_numpy slower with chunked array than going via pandas
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import numpy as np
num_rows, num_cols = 100_000, 2_000
data = np.random.randn(num_rows, num_cols).astype(np.float32)
num_chunks = 5
dfs = []
for i in range(num_chunks):
dfs.append(
pl.DataFrame(data=data.copy(), schema=[f"col_{i}" for i in range(num_cols)])
)
df = pl.concat(dfs, rechunk=False)
%%timeit
x = df.to_numpy()
%%timeit
x = df.to_arrow().to_pandas().to_numpy()
Log output
## `to_numpy()`
549 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## `to_arrow().to_pandas().to_numpy()`
383 ms ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Issue description
Converting from a polars dataframe with multiple chunks to a numpy array appears to be slower than going via pandas when there are a large number of columns.
The difference is not huge and the pandas method has a lot more variance but it would still be good to understand why in some cases the second method appears to be faster.
Expected behavior
to_numpy() is consistently as fast in all cases.
Installed versions
--------Version info---------
Polars: 0.20.26
Index type: UInt32
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: 2024.3.1
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.4
nest_asyncio: 1.6.0
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.2.2
pyarrow: 16.1.0
pydantic: 2.6.3
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.27
torch: 2.2.2+cu121
xlsx2csv: <not installed>
xlsxwriter: <not installed>
I can't replicate your timings on the latest release (on Mac though, so could be some platform differences) 🤔
This PR from @stinodego looks like it would be related to this issue and it should be part of the latest release - are you sure your benchmark was run with 0.20.26? This is what I get (using identical pandas/pyarrow library versions, as per your "Details" output):
to_numpy%timeit df.to_numpy() # 193 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)to_arrow().to_pandas().to_numpy()~2x slower%timeit df.to_arrow().to_pandas().to_numpy() # 381 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's interesting. Yeah I'm definitely running version 0.20.26.
I originally found this about 6 months ago and at the time I was unable to repro on a mac too. I don't have it to hand to test now. Do you see the same if you run on Linux?
Do you see the same results if you run df.to_numpy(use_pyarrow=False)? PyArrow is still the default engine, and it rechunks when converting to PyArrow.
This is what I get with use_pyarrow=False. Everything else is the same as above
The PR linked by Alex is indeed related. I discussed with Ritchie and I see now that I introduced a regression for chunked data that would otherwise be zero-copyable. I'll fix that.
Apparently there is also an inefficiency in the DataFrame-specific to_numpy code that we can address.
so I am running with polars==0.20.25
and my timings are even worse:
## `to_numpy()`
844 ms ± 9.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## `to_numpy(use_pyarrow=False)`
860 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## `to_arrow().to_pandas().to_numpy()`
441 ms ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have also testes polars==0.20.26 and results are very similar
## `to_numpy()`
853 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## `to_numpy(use_pyarrow=False)`
866 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## `to_arrow().to_pandas().to_numpy()`
481 ms ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)