cudf [BUG] Python groupby rolling aggregations return index inconsistent with pandas

Python groupby rolling aggregations return a single Index that corresponds to the original row position of the element, but in pandas return a MultiIndex that includes both the groupby key(s) and original row position.

This is not currently blocking any behavior with Dask + cuDF, as grouped rolling operations are blocked by #10173

import pandas as pd
import cudf
import numpy as np

df = cudf.datasets.randomdata(nrows=100000)
pdf = df.to_pandas()

print(pdf.groupby(['id']).rolling(window=3).x.mean().head())
print(df.groupby(['id']).rolling(window=3).x.mean().head())
id        
879  43605   NaN
881  3941    NaN
882  29855   NaN
884  14616   NaN
     70864   NaN
Name: x, dtype: float64
43605    <NA>
3941     <NA>
29855    <NA>
14616    <NA>
70864    <NA>
Name: x, dtype: float64

Feb 08 '22 17:02 beckernick

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 10 '22 18:03 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Jun 08 '22 19:06 github-actions[bot]

Took a quick look at this issue again. I suspect that the reason this occurs is due to the __getitem__ call (.x). RollingGroupby inherits from Rolling, which has the getitem implementation.

This implementation returns a standard Rolling object:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L204-L211

So the call to mean() goes down the standard Rolling._apply_agg path rather than the RollingGroupby._apply_agg. This path doesn't create a MultiIndex:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L264-L284

Unlike the RollingGroupby._apply_agg path:

https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L545-L559

Jul 13 '23 16:07 beckernick

To make this concrete:

import cudf

df = cudf.datasets.randomdata()

print("getitem call on the RollingGroupby object")
print(df.groupby(["id"]).rolling(window=3).x.mean().head())

print("getitem call before the RollingGroupby is created")
print(df.groupby(["id"]).x.rolling(window=3).mean().head())
getitem call on the RollingGroupby object
3    <NA>
8    <NA>
6    <NA>
7    <NA>
9    <NA>
Name: x, dtype: float64
getitem call before the RollingGroupby is created
id    
918  3    <NA>
963  8    <NA>
967  6    <NA>
990  7    <NA>
     9    <NA>
Name: x, dtype: float64

Jul 13 '23 17:07 beckernick

Ran into this today while using cudf.pandas.

cc @galipremsagar @vyasr , would this be a good first issue for a new contributor?

May 06 '24 18:05 beckernick

Yes, I think so!

May 13 '24 23:05 vyasr

cudf cudf copied to clipboard

[BUG] Python groupby rolling aggregations return index inconsistent with pandas

cudf
cudf copied to clipboard