polars icon indicating copy to clipboard operation
polars copied to clipboard

polars cumsum over a group is slower than pandas groupby cumsum

Open ChristopherRussell opened this issue 2 years ago • 5 comments

Problem description

Hi, I'm wondering if there is room to improve this functionality?

import pandas as pd
import numpy as np
import polars as pl

df = pd.DataFrame({'val': np.random.randn(5_000_000), 'key': ['a', 'b', 'c', 'd', 'e'] * 1_000_000})
df.key = df.key.astype('category')

%%time
df.groupby('key').val.cumsum()
# takes 50ms

pdf = pl.from_pandas(df)

%%time
pdf.select(pl.col('val').cumsum().over('key'))
# takes 200ms

ChristopherRussell avatar Dec 13 '22 16:12 ChristopherRussell

You are doing a window function, not a groupby?

ritchie46 avatar Dec 13 '22 16:12 ritchie46

New here - perhaps I don't understand how to achieve the same result as the pandas groupby cumsum in polar. This snippet: df.groupby('key').val.cumsum() returns a 'transformed' groupby, so is the same length as the dataframe has number of rows.

ChristopherRussell avatar Dec 13 '22 22:12 ChristopherRussell

You can .groupby().agg().explode() which gets closer to the pandas time.

>>> %timeit pdf.select(pl.col('val').cumsum().over('key'))
189 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pdf.groupby("key").agg(pl.col("val").cumsum()).explode("val")
117 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.groupby('key').val.cumsum()
58.2 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

cmdlineluser avatar Dec 13 '22 23:12 cmdlineluser

Thanks for the alternative solution @cmdlineluser, useful for as a polars newbie. It is a little different in that it changes the order of the output (which could be useful sometimes). Still a bit curious as to the difference in performance. Have found polars to outperform all other areas so far!

ChristopherRussell avatar Dec 14 '22 08:12 ChristopherRussell

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

cmdlineluser avatar Dec 14 '22 17:12 cmdlineluser

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

It seems that .group_by("key", maintain_order=True).agg(pl.col("val").cumsum()).explode("key") is still different from .select(pl.col('val').cumsum().over('key')), since the former will group the keys. For example, if key=['a', 'b', 'a', 'b'], then the former will change the order into ['a', 'a', 'b', 'b'] while the latter will not.

Is there a better equivalent expression?

desprado2 avatar Nov 16 '23 08:11 desprado2

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:

PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds

cmdlineluser avatar Nov 16 '23 09:11 cmdlineluser

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:

PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds

It seems that polars is still slower:

import numpy as np
import pandas as pd
import polars as pl


large_pdf = pd.DataFrame({"val": np.random.randn(5_000_000), "key": np.random.randint(114, 514, (5_000_000))})
large_pdf.set_index("key", inplace=True)
large_plf = pl.from_pandas(large_pdf, include_index=True)

%%time
large_plf.lazy().select(pl.col("val").cumsum().over("key")).collect()
# CPU times: user 316 ms, sys: 145 ms, total: 461 ms Wall time: 168 ms

%%time
large_pdf.groupby("key")["val"].cumsum()
# CPU times: user 70.7 ms, sys: 3.75 ms, total: 74.5 ms Wall time: 73.5 ms

desprado2 avatar Nov 16 '23 09:11 desprado2

Here's an attempted benchmark that runs everything in isolation:

Code
import time
import numpy as np
import pandas as pd
import polars as pl


def create_df():
    return pd.DataFrame({"val": np.random.randn(N), "key": np.random.randint(114, 514, (N))})


def pandas_with_index():
    df = create_df()
    df.set_index("key", inplace=True)

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def pandas_with_cat_index():
    df = create_df()
    df.key = df.key.astype("category")
    df.set_index("key", inplace=True)

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def pandas_without_index():
    df = create_df()

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def polars_with_index():
    df = create_df()
    df.set_index("key", inplace=True)
    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))

    return time.perf_counter() - start

def polars_with_cat_index():
    df = create_df()
    df.key = df.key.astype("category")
    df.set_index("key", inplace=True)

    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))

    return time.perf_counter() - start


def polars_without_index():
    df = create_df()
    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))
    return time.perf_counter() - start


N = 20_000_000

pkgs = "pandas", "polars"

tests = [
    "with_index",
    "with_cat_index",
    "without_index"
]

results = []

for pkg in pkgs:
    for test in tests:
        name = f"{pkg}_{test}"
        print(name)
        try:
            took = globals()[name]()
            results.append(
                {"name": name, "time": took}
            )
        except pl.ComputeError:
            print(f"[WARNING]: skipping test `{name}`")

print(
    pl.DataFrame(results)
)

Results

shape: (5, 2)
┌───────────────────────┬──────────┐
│ name                  ┆ time     │
│ ---                   ┆ ---      │
│ str                   ┆ f64      │
╞═══════════════════════╪══════════╡
│ pandas_with_index     ┆ 0.165075 │
│ pandas_with_cat_index ┆ 0.079915 │
│ pandas_without_index  ┆ 0.162487 │
│ polars_with_index     ┆ 0.363737 │
│ polars_without_index  ┆ 0.34091  │
└───────────────────────┴──────────┘

The original approach @ChristopherRussell posted with the category index df.key.astype('category') seems to be special-cased by pandas (I couldn't convert that using pl.from_pandas so had to skip it)

Perhaps @ritchie46 could take a look at the performance side of things when he has time.

cmdlineluser avatar Nov 16 '23 10:11 cmdlineluser