cudf
cudf copied to clipboard
[BUG] Python groupby rolling aggregations return index inconsistent with pandas
Python groupby rolling aggregations return a single Index that corresponds to the original row position of the element, but in pandas return a MultiIndex that includes both the groupby key(s) and original row position.
This is not currently blocking any behavior with Dask + cuDF, as grouped rolling operations are blocked by #10173
import pandas as pd
import cudf
import numpy as np
df = cudf.datasets.randomdata(nrows=100000)
pdf = df.to_pandas()
print(pdf.groupby(['id']).rolling(window=3).x.mean().head())
print(df.groupby(['id']).rolling(window=3).x.mean().head())
id
879 43605 NaN
881 3941 NaN
882 29855 NaN
884 14616 NaN
70864 NaN
Name: x, dtype: float64
43605 <NA>
3941 <NA>
29855 <NA>
14616 <NA>
70864 <NA>
Name: x, dtype: float64
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Took a quick look at this issue again. I suspect that the reason this occurs is due to the __getitem__ call (.x). RollingGroupby inherits from Rolling, which has the getitem implementation.
This implementation returns a standard Rolling object:
https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L204-L211
So the call to mean() goes down the standard Rolling._apply_agg path rather than the RollingGroupby._apply_agg. This path doesn't create a MultiIndex:
https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L264-L284
Unlike the RollingGroupby._apply_agg path:
https://github.com/rapidsai/cudf/blob/3bacb12deca71667646057d1790f0786c20c2d53/python/cudf/cudf/core/window/rolling.py#L545-L559
To make this concrete:
import cudf
df = cudf.datasets.randomdata()
print("getitem call on the RollingGroupby object")
print(df.groupby(["id"]).rolling(window=3).x.mean().head())
print("getitem call before the RollingGroupby is created")
print(df.groupby(["id"]).x.rolling(window=3).mean().head())
getitem call on the RollingGroupby object
3 <NA>
8 <NA>
6 <NA>
7 <NA>
9 <NA>
Name: x, dtype: float64
getitem call before the RollingGroupby is created
id
918 3 <NA>
963 8 <NA>
967 6 <NA>
990 7 <NA>
9 <NA>
Name: x, dtype: float64
Ran into this today while using cudf.pandas.
cc @galipremsagar @vyasr , would this be a good first issue for a new contributor?
Yes, I think so!