polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars ~20x slower than pandas in DataFrame creation for `list[...]` columns using generators

Open Julian-J-S opened this issue 1 year ago • 11 comments

Description

Noticed this when rewriting a pandas workflow to polars that DataFrame creation was about 20x slower in polars (~4sec) compared to pandas (~0.2sec)

I am creating a DataFrame using a python generator because I have to deal with multiple batches of deeply nested json data coming from an API.

The original code is something like this

gen_prices = (
    price_scale["prices"]
    for batch in batches
    for article in batch["articles"]
    for price_scale in article["priceScales"]
)

pl.DataFrame(
    {
        "prices": gen_prices,
        # ...
    }
)

which returns a list of floats per row (next(gen_prices)). It seems this has nothing to do with the generator itself but with the data type of the column. It is performant for scalar types (int, float, string) but is really slow for list types (list[float]/list[str])

Example

import polars as pl
import pandas as pd
import numpy as np

DATA = np.random.rand(1_000_000, 2)

def get_data():
    return (row for row in DATA)

# pandas ~0.2 sec
pd.DataFrame({"x": get_data()})

# polars ~ 4 sec
pl.DataFrame({"x": get_data()})

NOTE: this happens only for list[...] types. If you change the data to

  • DATA = np.random.rand(1_000_000)

then polars is again faster

Julian-J-S avatar Nov 29 '23 15:11 Julian-J-S

The issue here isn't that it's a generator, it's that it's an array of np.ndarrays, causing numpy_to_pyseries to be called for each of the 1 million elements. This happens without a generator:

x = [*get_data()]
pl.Series(x)  # slow
pd.Series(x)  # fast

The reason why pandas is so much faster is because pandas simply assigns it the object dtype and performs no transformations on the data, whereas Polars has to internally convert the data to polars lists. If you do the same in polars and supply dtype=pl.Object, you'll see that polars vastly outperforms pandas:

# file: check_list.ipynb
import polars as pl
import pandas as pd
import numpy as np

DATA = np.random.rand(1_000_000, 2)

def get_data():
    return (row for row in DATA)

# pandas 11.6s
pd.Series(get_data())

# polars 4.0s
pl.DataFrame({"x": get_data()})

mcrumiller avatar Nov 29 '23 18:11 mcrumiller

As you say, the fact that it's a generator doesn't matter, it's just that it's a list

For me pl.DataFrame({"x": DATA}) takes 1m17s

with

--------Version info---------
Polars:              0.19.15
Index type:          UInt32
Platform:            Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:              3.10.13 (main, Nov  1 2023, 14:20:38) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         3.0.0
connectorx:          0.3.1
deltalake:           <not installed>
fsspec:              2023.10.0
gevent:              <not installed>
matplotlib:          3.8.1
numpy:               1.26.1
openpyxl:            3.1.2
pandas:              2.1.2
pyarrow:             14.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              1.0.10
sqlalchemy:          2.0.23
xlsx2csv:            0.8.1
xlsxwriter:          3.1.9

when I update to 19.18 then it took 7.5s the first time but then I did it a couple more times and got 21.6s and then 23.2s.

No idea why the huge variation in times.

deanm0000 avatar Nov 29 '23 18:11 deanm0000

@mcrumiller is on exactly the right track with why the initial pandas init is that much faster - it just reads out the generator as a list and directly assigns that list to the underlying BlockManager. However... you are almost certain to eat the consolidation cost of transforming this collection into a real 2D ndarray later if you do almost anything meaningful with the given DataFrame:

From Pandas' Internal Architecture^1 section:

When you would invoke an operation that benefited from a single consolidated 
2-dimensional ndarray of say float64 dtype (for example: using reindex or 
performing a row-oriented operation), the BlockManager would glue together 
its accumulated pieces to create a single 2D ndarray of each data type. 
This is called consolidation in the codebase.

Nevertheless, we can probably do better, so I will get profiling and see where we can speed things up.

alexander-beedie avatar Nov 29 '23 19:11 alexander-beedie

thanks for your ideas and insights :) you are right, just referring to the python object in the column is always much easier/faster.

Just a side note: using the numpy data directly as input in polars without the generator is in my case 200x faster. So this seems to do a lot of optimisation as well.

# 0.2 sec (numpy ndarray)
pl.DataFrame({"x": DATA})


# 40 sec (generator)
pl.DataFrame({"x": get_data()})

Julian-J-S avatar Nov 29 '23 21:11 Julian-J-S

Just a side note: using the numpy data directly as input in polars without the generator is in my case 200x faster. So this seems to do a lot of optimisation as well.

Sure, because there we only have one ndarray. With the generator, it's collected into a list of 1 million ndarrays.

mcrumiller avatar Nov 29 '23 21:11 mcrumiller

Got this case (init from list of numpy arrays) down to only a ~4x difference now, which feels pretty good considering that includes 100% of the consolidation costs on the Polars side (vs none for Pandas). PR incoming 🤔

using the numpy data directly as input in polars without the generator is in my case 200x faster.

Yes, if you can pass the ndarray straight in that will take advantage of a recent optimisation - which I'm also leaning on for the imminent PR, but initialising from a list of arrays requires a consolidation step so will always be slower.

alexander-beedie avatar Nov 29 '23 21:11 alexander-beedie

Wow amazing!

My real world use case (as explained at the top) is combining data of python lists inside different deeply nested json/dicts. Probably not much optimization potential here? 😄

Julian-J-S avatar Nov 29 '23 21:11 Julian-J-S

~~Note: I marked the linked PR as closing this //~~ (updated)

FYI: with DATA set to (10_000_000, 2) the performance is nearly 10x what it was before; I suspect the degree of speedup will actually continue to increase with larger data sizes...

alexander-beedie avatar Nov 29 '23 21:11 alexander-beedie

@alexander-beedie Thank you so much for the great work, I really appreciate it.

However, with the current optimizations my "real world use case" probably will not change at all I fear 😞

As I said in my original post I need to consilidate data from batches of json/dicts so I am working with python objects/list and not numpy.

But I guess there must also be optimization potential for this if you look at the example below which is much closer to a real scenario:

from random import randint

NUM_ITEMS = 1_000_000

# create fake json data
DATA = [
    dict(
        values=[i for i in range(randint(1, 3))],
        number=7,
    )
    for _ in range(NUM_ITEMS)
]

DATA[:4]
#[{'values': [0, 1], 'number': 7},
# {'values': [0, 1, 2], 'number': 7},
# {'values': [0], 'number': 7},
# {'values': [0], 'number': 7}]

gen_values = (d["values"] for d in DATA)
gen_number = (d["number"] for d in DATA)

pl.DataFrame(
    {
        "values": gen_values,   # ~4630 ms
        "number": gen_number,   #   ~85 ms (55x faster!!)
    }
)
shape: (1_000_000, 2)
┌───────────┬────────┐
│ values    ┆ number │
│ ---       ┆ ---    │
│ list[i64] ┆ i64    │
╞═══════════╪════════╡
│ [0, 1, 2] ┆ 7      │
│ [0]       ┆ 7      │
│ [0]       ┆ 7      │
│ [0]       ┆ 7      │
│ [0, 1]    ┆ 7      │
│ …         ┆ …      │
│ [0, 1, 2] ┆ 7      │
│ [0, 1]    ┆ 7      │
│ [0, 1]    ┆ 7      │
│ [0]       ┆ 7      │
│ [0]       ┆ 7      │
│ [0, 1, 2] ┆ 7      │
└───────────┴────────┘

There seems to be a vast overhead of creating the list[...] column.

  • number column: 0.085s (85 ms)
  • list column: 4.63s (4630 ms)

The list column takes 55x longer to create. 😲

If you say there is nothing to optimize for this use case I am fine if you close this issue but it would be awesome if you could look into this because this might include more real world use cases! 😃

Julian-J-S avatar Nov 30 '23 07:11 Julian-J-S

However, with the current optimizations my "real world use case" probably will not change at all I fear 😞

Ahhh... gotcha; yes, this new optimisation only applies to lists of numpy arrays at the moment. Will change the "closes" tag to just "ref" in the linked PR and revisit the plain "list of lists" case - let's see if we can squeeze more juice from this lemon 😄🍋

alexander-beedie avatar Nov 30 '23 08:11 alexander-beedie

@alexander-beedie I listed this as P-low since you've already dipped your toe in the water and maybe have an idea how to get more juice out of the lemon

deanm0000 avatar Jan 10 '24 14:01 deanm0000