polars polars cumsum over a group is slower than pandas groupby cumsum

Problem description

Hi, I'm wondering if there is room to improve this functionality?

import pandas as pd
import numpy as np
import polars as pl

df = pd.DataFrame({'val': np.random.randn(5_000_000), 'key': ['a', 'b', 'c', 'd', 'e'] * 1_000_000})
df.key = df.key.astype('category')

%%time
df.groupby('key').val.cumsum()
# takes 50ms

pdf = pl.from_pandas(df)

%%time
pdf.select(pl.col('val').cumsum().over('key'))
# takes 200ms

Dec 13 '22 16:12 ChristopherRussell

You are doing a window function, not a groupby?

Dec 13 '22 16:12 ritchie46

New here - perhaps I don't understand how to achieve the same result as the pandas groupby cumsum in polar. This snippet: df.groupby('key').val.cumsum() returns a 'transformed' groupby, so is the same length as the dataframe has number of rows.

Dec 13 '22 22:12 ChristopherRussell

You can .groupby().agg().explode() which gets closer to the pandas time.

>>> %timeit pdf.select(pl.col('val').cumsum().over('key'))
189 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pdf.groupby("key").agg(pl.col("val").cumsum()).explode("val")
117 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.groupby('key').val.cumsum()
58.2 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dec 13 '22 23:12 cmdlineluser

Thanks for the alternative solution @cmdlineluser, useful for as a polars newbie. It is a little different in that it changes the order of the output (which could be useful sometimes). Still a bit curious as to the difference in performance. Have found polars to outperform all other areas so far!

Dec 14 '22 08:12 ChristopherRussell

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

Dec 14 '22 17:12 cmdlineluser

groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.

It seems that .group_by("key", maintain_order=True).agg(pl.col("val").cumsum()).explode("key") is still different from .select(pl.col('val').cumsum().over('key')), since the former will group the keys. For example, if key=['a', 'b', 'a', 'b'], then the former will change the order into ['a', 'a', 'b', 'b'] while the latter will not.

Is there a better equivalent expression?

Nov 16 '23 08:11 desprado2

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:

PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds

Nov 16 '23 09:11 cmdlineluser

@desprado2 Very true, I missed that at the time - my bad.

Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?

Trying this now I get:
PANDAS     Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS     Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds

It seems that polars is still slower:

import numpy as np
import pandas as pd
import polars as pl


large_pdf = pd.DataFrame({"val": np.random.randn(5_000_000), "key": np.random.randint(114, 514, (5_000_000))})
large_pdf.set_index("key", inplace=True)
large_plf = pl.from_pandas(large_pdf, include_index=True)

%%time
large_plf.lazy().select(pl.col("val").cumsum().over("key")).collect()
# CPU times: user 316 ms, sys: 145 ms, total: 461 ms Wall time: 168 ms

%%time
large_pdf.groupby("key")["val"].cumsum()
# CPU times: user 70.7 ms, sys: 3.75 ms, total: 74.5 ms Wall time: 73.5 ms

Nov 16 '23 09:11 desprado2

Here's an attempted benchmark that runs everything in isolation:

Code

import time
import numpy as np
import pandas as pd
import polars as pl


def create_df():
    return pd.DataFrame({"val": np.random.randn(N), "key": np.random.randint(114, 514, (N))})


def pandas_with_index():
    df = create_df()
    df.set_index("key", inplace=True)

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def pandas_with_cat_index():
    df = create_df()
    df.key = df.key.astype("category")
    df.set_index("key", inplace=True)

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def pandas_without_index():
    df = create_df()

    start = time.perf_counter()
    df.groupby("key")["val"].cumsum()

    return time.perf_counter() - start


def polars_with_index():
    df = create_df()
    df.set_index("key", inplace=True)
    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))

    return time.perf_counter() - start

def polars_with_cat_index():
    df = create_df()
    df.key = df.key.astype("category")
    df.set_index("key", inplace=True)

    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))

    return time.perf_counter() - start


def polars_without_index():
    df = create_df()
    df = pl.from_pandas(df, include_index=True)

    start = time.perf_counter()
    df.select(pl.col("val").cumsum().over("key"))
    return time.perf_counter() - start


N = 20_000_000

pkgs = "pandas", "polars"

tests = [
    "with_index",
    "with_cat_index",
    "without_index"
]

results = []

for pkg in pkgs:
    for test in tests:
        name = f"{pkg}_{test}"
        print(name)
        try:
            took = globals()[name]()
            results.append(
                {"name": name, "time": took}
            )
        except pl.ComputeError:
            print(f"[WARNING]: skipping test `{name}`")

print(
    pl.DataFrame(results)
)

Results

shape: (5, 2)
┌───────────────────────┬──────────┐
│ name                  ┆ time     │
│ ---                   ┆ ---      │
│ str                   ┆ f64      │
╞═══════════════════════╪══════════╡
│ pandas_with_index     ┆ 0.165075 │
│ pandas_with_cat_index ┆ 0.079915 │
│ pandas_without_index  ┆ 0.162487 │
│ polars_with_index     ┆ 0.363737 │
│ polars_without_index  ┆ 0.34091  │
└───────────────────────┴──────────┘

The original approach @ChristopherRussell posted with the category index df.key.astype('category') seems to be special-cased by pandas (I couldn't convert that using pl.from_pandas so had to skip it)

Perhaps @ritchie46 could take a look at the performance side of things when he has time.

Nov 16 '23 10:11 cmdlineluser

polars polars copied to clipboard

polars cumsum over a group is slower than pandas groupby cumsum

Problem description

Results

polars
polars copied to clipboard