polars
polars copied to clipboard
Polars ~20x slower than pandas in DataFrame creation for `list[...]` columns using generators
Description
Noticed this when rewriting a pandas workflow to polars that DataFrame creation was about 20x slower in polars (~4sec) compared to pandas (~0.2sec)
I am creating a DataFrame using a python generator because I have to deal with multiple batches of deeply nested json data coming from an API.
The original code is something like this
gen_prices = (
price_scale["prices"]
for batch in batches
for article in batch["articles"]
for price_scale in article["priceScales"]
)
pl.DataFrame(
{
"prices": gen_prices,
# ...
}
)
which returns a list of floats per row (next(gen_prices)
).
It seems this has nothing to do with the generator itself but with the data type of the column.
It is performant for scalar types (int, float, string) but is really slow for list types (list[float]/list[str]
)
Example
import polars as pl
import pandas as pd
import numpy as np
DATA = np.random.rand(1_000_000, 2)
def get_data():
return (row for row in DATA)
# pandas ~0.2 sec
pd.DataFrame({"x": get_data()})
# polars ~ 4 sec
pl.DataFrame({"x": get_data()})
NOTE: this happens only for list[...]
types. If you change the data to
-
DATA = np.random.rand(1_000_000)
then polars is again faster
The issue here isn't that it's a generator, it's that it's an array of np.ndarray
s, causing numpy_to_pyseries
to be called for each of the 1 million elements. This happens without a generator:
x = [*get_data()]
pl.Series(x) # slow
pd.Series(x) # fast
The reason why pandas is so much faster is because pandas simply assigns it the object
dtype and performs no transformations on the data, whereas Polars has to internally convert the data to polars lists. If you do the same in polars and supply dtype=pl.Object
, you'll see that polars vastly outperforms pandas:
# file: check_list.ipynb
import polars as pl
import pandas as pd
import numpy as np
DATA = np.random.rand(1_000_000, 2)
def get_data():
return (row for row in DATA)
# pandas 11.6s
pd.Series(get_data())
# polars 4.0s
pl.DataFrame({"x": get_data()})
As you say, the fact that it's a generator doesn't matter, it's just that it's a list
For me pl.DataFrame({"x": DATA})
takes 1m17s
with
--------Version info---------
Polars: 0.19.15
Index type: UInt32
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.10.13 (main, Nov 1 2023, 14:20:38) [GCC 10.2.1 20210110]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: 3.0.0
connectorx: 0.3.1
deltalake: <not installed>
fsspec: 2023.10.0
gevent: <not installed>
matplotlib: 3.8.1
numpy: 1.26.1
openpyxl: 3.1.2
pandas: 2.1.2
pyarrow: 14.0.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: 1.0.10
sqlalchemy: 2.0.23
xlsx2csv: 0.8.1
xlsxwriter: 3.1.9
when I update to 19.18 then it took 7.5s the first time but then I did it a couple more times and got 21.6s and then 23.2s.
No idea why the huge variation in times.
@mcrumiller is on exactly the right track with why the initial pandas init is that much faster - it just reads out the generator as a list and directly assigns that list to the underlying BlockManager. However... you are almost certain to eat the consolidation cost of transforming this collection into a real 2D ndarray
later if you do almost anything meaningful with the given DataFrame:
From Pandas' Internal Architecture^1 section:
When you would invoke an operation that benefited from a single consolidated
2-dimensional ndarray of say float64 dtype (for example: using reindex or
performing a row-oriented operation), the BlockManager would glue together
its accumulated pieces to create a single 2D ndarray of each data type.
This is called consolidation in the codebase.
Nevertheless, we can probably do better, so I will get profiling and see where we can speed things up.
thanks for your ideas and insights :) you are right, just referring to the python object in the column is always much easier/faster.
Just a side note: using the numpy data directly as input in polars without the generator is in my case 200x faster. So this seems to do a lot of optimisation as well.
# 0.2 sec (numpy ndarray)
pl.DataFrame({"x": DATA})
# 40 sec (generator)
pl.DataFrame({"x": get_data()})
Just a side note: using the numpy data directly as input in polars without the generator is in my case 200x faster. So this seems to do a lot of optimisation as well.
Sure, because there we only have one ndarray
. With the generator, it's collected into a list of 1 million ndarray
s.
Got this case (init from list of numpy arrays) down to only a ~4x difference now, which feels pretty good considering that includes 100% of the consolidation costs on the Polars side (vs none for Pandas). PR incoming 🤔
using the numpy data directly as input in polars without the generator is in my case 200x faster.
Yes, if you can pass the ndarray straight in that will take advantage of a recent optimisation - which I'm also leaning on for the imminent PR, but initialising from a list of arrays requires a consolidation step so will always be slower.
Wow amazing!
My real world use case (as explained at the top) is combining data of python lists inside different deeply nested json/dicts. Probably not much optimization potential here? 😄
~~Note: I marked the linked PR as closing this //~~ (updated)
FYI: with DATA set to (10_000_000, 2) the performance is nearly 10x what it was before; I suspect the degree of speedup will actually continue to increase with larger data sizes...
@alexander-beedie Thank you so much for the great work, I really appreciate it.
However, with the current optimizations my "real world use case" probably will not change at all I fear 😞
As I said in my original post I need to consilidate data from batches of json/dicts so I am working with python objects/list and not numpy.
But I guess there must also be optimization potential for this if you look at the example below which is much closer to a real scenario:
from random import randint
NUM_ITEMS = 1_000_000
# create fake json data
DATA = [
dict(
values=[i for i in range(randint(1, 3))],
number=7,
)
for _ in range(NUM_ITEMS)
]
DATA[:4]
#[{'values': [0, 1], 'number': 7},
# {'values': [0, 1, 2], 'number': 7},
# {'values': [0], 'number': 7},
# {'values': [0], 'number': 7}]
gen_values = (d["values"] for d in DATA)
gen_number = (d["number"] for d in DATA)
pl.DataFrame(
{
"values": gen_values, # ~4630 ms
"number": gen_number, # ~85 ms (55x faster!!)
}
)
shape: (1_000_000, 2)
┌───────────┬────────┐
│ values ┆ number │
│ --- ┆ --- │
│ list[i64] ┆ i64 │
╞═══════════╪════════╡
│ [0, 1, 2] ┆ 7 │
│ [0] ┆ 7 │
│ [0] ┆ 7 │
│ [0] ┆ 7 │
│ [0, 1] ┆ 7 │
│ … ┆ … │
│ [0, 1, 2] ┆ 7 │
│ [0, 1] ┆ 7 │
│ [0, 1] ┆ 7 │
│ [0] ┆ 7 │
│ [0] ┆ 7 │
│ [0, 1, 2] ┆ 7 │
└───────────┴────────┘
There seems to be a vast overhead of creating the list[...]
column.
- number column: 0.085s (85 ms)
- list column: 4.63s (4630 ms)
The list column takes 55x longer to create. 😲
If you say there is nothing to optimize for this use case I am fine if you close this issue but it would be awesome if you could look into this because this might include more real world use cases! 😃
However, with the current optimizations my "real world use case" probably will not change at all I fear 😞
Ahhh... gotcha; yes, this new optimisation only applies to lists of numpy
arrays at the moment. Will change the "closes" tag to just "ref" in the linked PR and revisit the plain "list of lists" case - let's see if we can squeeze more juice from this lemon 😄🍋
@alexander-beedie I listed this as P-low since you've already dipped your toe in the water and maybe have an idea how to get more juice out of the lemon