polars
polars copied to clipboard
polars cumsum over a group is slower than pandas groupby cumsum
Problem description
Hi, I'm wondering if there is room to improve this functionality?
import pandas as pd
import numpy as np
import polars as pl
df = pd.DataFrame({'val': np.random.randn(5_000_000), 'key': ['a', 'b', 'c', 'd', 'e'] * 1_000_000})
df.key = df.key.astype('category')
%%time
df.groupby('key').val.cumsum()
# takes 50ms
pdf = pl.from_pandas(df)
%%time
pdf.select(pl.col('val').cumsum().over('key'))
# takes 200ms
You are doing a window function, not a groupby?
New here - perhaps I don't understand how to achieve the same result as the pandas groupby cumsum in polar. This snippet: df.groupby('key').val.cumsum() returns a 'transformed' groupby, so is the same length as the dataframe has number of rows.
You can .groupby().agg().explode() which gets closer to the pandas time.
>>> %timeit pdf.select(pl.col('val').cumsum().over('key'))
189 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pdf.groupby("key").agg(pl.col("val").cumsum()).explode("val")
117 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.groupby('key').val.cumsum()
58.2 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Thanks for the alternative solution @cmdlineluser, useful for as a polars newbie. It is a little different in that it changes the order of the output (which could be useful sometimes). Still a bit curious as to the difference in performance. Have found polars to outperform all other areas so far!
groupby can maintain the original order if desired: .groupby("key", maintain_order=True) - hopefully others with more knowledge than myself can comment on the performance.
groupby can maintain the original order if desired:
.groupby("key", maintain_order=True)- hopefully others with more knowledge than myself can comment on the performance.
It seems that .group_by("key", maintain_order=True).agg(pl.col("val").cumsum()).explode("key") is still different from .select(pl.col('val').cumsum().over('key')), since the former will group the keys. For example, if key=['a', 'b', 'a', 'b'], then the former will change the order into ['a', 'a', 'b', 'b'] while the latter will not.
Is there a better equivalent expression?
@desprado2 Very true, I missed that at the time - my bad.
Are you having performance issues with .select(pl.col('val').cumsum().over('key')) ?
Trying this now I get:
PANDAS Elapsed time: 0.2451 seconds
PANDAS CAT Elapsed time: 0.2743 seconds
POLARS Elapsed time: 0.2847 seconds
POLARS CAT Elapsed time: 0.1968 seconds
@desprado2 Very true, I missed that at the time - my bad.
Are you having performance issues with
.select(pl.col('val').cumsum().over('key'))?Trying this now I get:
PANDAS Elapsed time: 0.2451 seconds PANDAS CAT Elapsed time: 0.2743 seconds POLARS Elapsed time: 0.2847 seconds POLARS CAT Elapsed time: 0.1968 seconds
It seems that polars is still slower:
import numpy as np
import pandas as pd
import polars as pl
large_pdf = pd.DataFrame({"val": np.random.randn(5_000_000), "key": np.random.randint(114, 514, (5_000_000))})
large_pdf.set_index("key", inplace=True)
large_plf = pl.from_pandas(large_pdf, include_index=True)
%%time
large_plf.lazy().select(pl.col("val").cumsum().over("key")).collect()
# CPU times: user 316 ms, sys: 145 ms, total: 461 ms Wall time: 168 ms
%%time
large_pdf.groupby("key")["val"].cumsum()
# CPU times: user 70.7 ms, sys: 3.75 ms, total: 74.5 ms Wall time: 73.5 ms
Here's an attempted benchmark that runs everything in isolation:
Code
import time
import numpy as np
import pandas as pd
import polars as pl
def create_df():
return pd.DataFrame({"val": np.random.randn(N), "key": np.random.randint(114, 514, (N))})
def pandas_with_index():
df = create_df()
df.set_index("key", inplace=True)
start = time.perf_counter()
df.groupby("key")["val"].cumsum()
return time.perf_counter() - start
def pandas_with_cat_index():
df = create_df()
df.key = df.key.astype("category")
df.set_index("key", inplace=True)
start = time.perf_counter()
df.groupby("key")["val"].cumsum()
return time.perf_counter() - start
def pandas_without_index():
df = create_df()
start = time.perf_counter()
df.groupby("key")["val"].cumsum()
return time.perf_counter() - start
def polars_with_index():
df = create_df()
df.set_index("key", inplace=True)
df = pl.from_pandas(df, include_index=True)
start = time.perf_counter()
df.select(pl.col("val").cumsum().over("key"))
return time.perf_counter() - start
def polars_with_cat_index():
df = create_df()
df.key = df.key.astype("category")
df.set_index("key", inplace=True)
df = pl.from_pandas(df, include_index=True)
start = time.perf_counter()
df.select(pl.col("val").cumsum().over("key"))
return time.perf_counter() - start
def polars_without_index():
df = create_df()
df = pl.from_pandas(df, include_index=True)
start = time.perf_counter()
df.select(pl.col("val").cumsum().over("key"))
return time.perf_counter() - start
N = 20_000_000
pkgs = "pandas", "polars"
tests = [
"with_index",
"with_cat_index",
"without_index"
]
results = []
for pkg in pkgs:
for test in tests:
name = f"{pkg}_{test}"
print(name)
try:
took = globals()[name]()
results.append(
{"name": name, "time": took}
)
except pl.ComputeError:
print(f"[WARNING]: skipping test `{name}`")
print(
pl.DataFrame(results)
)
Results
shape: (5, 2)
┌───────────────────────┬──────────┐
│ name ┆ time │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════════════════╪══════════╡
│ pandas_with_index ┆ 0.165075 │
│ pandas_with_cat_index ┆ 0.079915 │
│ pandas_without_index ┆ 0.162487 │
│ polars_with_index ┆ 0.363737 │
│ polars_without_index ┆ 0.34091 │
└───────────────────────┴──────────┘
The original approach @ChristopherRussell posted with the category index df.key.astype('category') seems to be special-cased by pandas (I couldn't convert that using pl.from_pandas so had to skip it)
Perhaps @ritchie46 could take a look at the performance side of things when he has time.