polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars to_numpy slower with chunked array than going via pandas

Open dannyfriar opened this issue 1 year ago • 7 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np

num_rows, num_cols = 100_000, 2_000
data = np.random.randn(num_rows, num_cols).astype(np.float32)

num_chunks = 5
dfs = []

for i in range(num_chunks):
    dfs.append(
        pl.DataFrame(data=data.copy(), schema=[f"col_{i}" for i in range(num_cols)])
    )

df = pl.concat(dfs, rechunk=False)

%%timeit
x = df.to_numpy()

%%timeit
x = df.to_arrow().to_pandas().to_numpy()

Log output

## `to_numpy()`
549 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## `to_arrow().to_pandas().to_numpy()`
383 ms ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Issue description

Converting from a polars dataframe with multiple chunks to a numpy array appears to be slower than going via pandas when there are a large number of columns.

The difference is not huge and the pandas method has a lot more variance but it would still be good to understand why in some cases the second method appears to be faster.

Expected behavior

to_numpy() is consistently as fast in all cases.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.27
torch:                2.2.2+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

dannyfriar avatar May 16 '24 09:05 dannyfriar

I can't replicate your timings on the latest release (on Mac though, so could be some platform differences) 🤔

This PR from @stinodego looks like it would be related to this issue and it should be part of the latest release - are you sure your benchmark was run with 0.20.26? This is what I get (using identical pandas/pyarrow library versions, as per your "Details" output):

  • to_numpy
    %timeit df.to_numpy()
    # 193 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
  • to_arrow().to_pandas().to_numpy() ~2x slower
    %timeit df.to_arrow().to_pandas().to_numpy()
    # 381 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

alexander-beedie avatar May 16 '24 11:05 alexander-beedie

That's interesting. Yeah I'm definitely running version 0.20.26.

I originally found this about 6 months ago and at the time I was unable to repro on a mac too. I don't have it to hand to test now. Do you see the same if you run on Linux?

dannyfriar avatar May 16 '24 11:05 dannyfriar

Do you see the same results if you run df.to_numpy(use_pyarrow=False)? PyArrow is still the default engine, and it rechunks when converting to PyArrow.

stinodego avatar May 16 '24 11:05 stinodego

image

This is what I get with use_pyarrow=False. Everything else is the same as above

dannyfriar avatar May 16 '24 11:05 dannyfriar

The PR linked by Alex is indeed related. I discussed with Ritchie and I see now that I introduced a regression for chunked data that would otherwise be zero-copyable. I'll fix that.

Apparently there is also an inefficiency in the DataFrame-specific to_numpy code that we can address.

stinodego avatar May 16 '24 11:05 stinodego

so I am running with polars==0.20.25

and my timings are even worse:

## `to_numpy()`
844 ms ± 9.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## `to_numpy(use_pyarrow=False)`
860 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## `to_arrow().to_pandas().to_numpy()`
441 ms ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

radomirgr avatar May 16 '24 12:05 radomirgr

I have also testes polars==0.20.26 and results are very similar

## `to_numpy()`
853 ms ± 5.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## `to_numpy(use_pyarrow=False)`
866 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

## `to_arrow().to_pandas().to_numpy()`
481 ms ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

radomirgr avatar May 16 '24 12:05 radomirgr