pandas icon indicating copy to clipboard operation
pandas copied to clipboard

ENH: Support rolling over one level of a MultiIndex

Open miccoli opened this issue 5 years ago • 13 comments

  • [x] I have searched the [pandas] tag on StackOverflow for similar questions.

  • [ ] I have asked my usage related question on StackOverflow.


I had a hard time understanding how df.rolling works when df is indexed by a MultiIndex

This is an example data frame:

import pandas as pd
idx = pd.MultiIndex.from_product(
    [pd.date_range("2020-01-01", "2020-1-10"), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

which outputs

                c1
date       obs    
2020-01-01 a     0
           b     1
2020-01-02 a     2
           b     3
2020-01-03 a     4
           b     5
2020-01-04 a     6
           b     7
2020-01-05 a     8
           b     9
2020-01-06 a    10
           b    11
2020-01-07 a    12
           b    13
2020-01-08 a    14
           b    15
2020-01-09 a    16
           b    17
2020-01-10 a    18
           b    19

Now I want to apply a rolling window on the date level, keeping the obs level separate.

I tried with no success obvious and simple (least surprise) commands like

  • df.rolling("7d", index="date") or
  • df.rolling("7d", on="date")

but finally the desired result is obtained by

df_r = df.groupby(by="obs", group_keys=False).rolling(
    "7d", on=df.index.levels[0]
).mean().sort_index()

print(df_r)

which gives me the correct result:

                  c1
date       obs      
2020-01-01 a     0.0
           b     1.0
2020-01-02 a     1.0
           b     2.0
2020-01-03 a     2.0
           b     3.0
2020-01-04 a     3.0
           b     4.0
2020-01-05 a     4.0
           b     5.0
2020-01-06 a     5.0
           b     6.0
2020-01-07 a     6.0
           b     7.0
2020-01-08 a     8.0
           b     9.0
2020-01-09 a    10.0
           b    11.0
2020-01-10 a    12.0
           b    13.0

It seams to me that this should be a quite common situation, so I was wondering if there is a simpler way to obtain the same results. By the way my solution is not very robust, because there are hidden assumptions on how the objects returned by groupby are indexed, which do not necessarily hold for a generic data frame.

Moreover the doc of the on parameter in rolling was almost incomprehensible to me: I'm still wondering if my usage rolling( "7d", on=df.index.levels[0]) is the intended one or not.

miccoli avatar Jun 08 '20 12:06 miccoli

Probably, there is a related issue #34617. By the ways, if data frame is indexed with integer index then one can use level keyword of .rolling(). This example is not consistent and just shows API features.

import pandas as pd
idx = pd.MultiIndex.from_product(
    [range(7), ["a", "b"]], names=["date", "obs"],
)
df = pd.DataFrame(index=idx)
df['c1'] = range(len(df))

print(df)

df_r = df.groupby(by="obs", group_keys=False).rolling(
    7, level=1
).mean().sort_index()

print(df_r)

daskol avatar Jun 08 '20 15:06 daskol

In @daskol example the level keyword seems a noop, as any other arg passed to rolling.

from pandas.testing import assert_frame_equal

assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, level=1).mean(),
)
assert_frame_equal(
    df.groupby(by="obs", group_keys=False).rolling(7).mean(),
    df.groupby(by="obs", group_keys=False).rolling(7, foo=8, bar='8').mean(),
)

I'm still very confused.

miccoli avatar Jun 08 '20 20:06 miccoli

A brief update. After updating pandas to v1.1.1 the above walk around fails with SEGFAULT!

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0])
RollingGroupby [window=604800000000000,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', name='date', freq=None)]

but

>>> df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()
Segmentation fault: 11

It seems to me that this issue should be labeled as BUG

pandas version info
INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.2.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : 0.5.0
scipy            : 1.5.2
sqlalchemy       : None
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.16.0
xlrd             : 1.2.0
xlwt             : None
numba            : 0.51.1

miccoli avatar Sep 07 '20 07:09 miccoli

SEGFAULT possibly linked to #36018

miccoli avatar Sep 07 '20 07:09 miccoli

hi, I wonder if there were any update on this issue, in my situation, I am also unable to do rolling with non-integer windows when there is multi-index in the dataframe. With the simple example that was used in this thread, I would get different error like the following:

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]) Out[21]: RollingGroupby [window=7d,min_periods=1,center=False,win_type=freq,axis=0,on=DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'], dtype='datetime64[ns]', name='date', freq='D'),method=single]

df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean() Traceback (most recent call last):

File "C:\Users\c_yy\AppData\Local\Temp/ipykernel_15476/3165508401.py", line 1, in df.groupby(by="obs", group_keys=False).rolling("7d", on=df.index.levels[0]).mean()

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1855, in mean return super().mean(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 1309, in mean return self._apply(window_func, name="mean", **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 594, in _apply result = super()._apply(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 545, in _apply return self._apply_blockwise(homogeneous_func, name)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 441, in _apply_blockwise new_mgr = mgr.apply(hfunc, ignore_failures=True)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 325, in apply applied = b.apply(f, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\internals\blocks.py", line 381, in apply result = func(self.values, **kwargs)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 431, in hfunc res_values = homogeneous_func(values)

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 535, in homogeneous_func result = np.apply_along_axis(calc, self.axis, values)

File "<array_function internals>", line 5, in apply_along_axis

File "C:\Users\c_yy\anaconda3\lib\site-packages\numpy\lib\shape_base.py", line 379, in apply_along_axis res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\rolling.py", line 521, in calc start, end = window_indexer.get_window_bounds(

File "C:\Users\c_yy\anaconda3\lib\site-packages\pandas\core\window\indexers.py", line 333, in get_window_bounds index_array = self.index_array.take(ensure_platform_int(indices))

IndexError: index 10 is out of bounds for axis 0 with size 10

When using my own real data set (see below), I get different error:

rxx.loc['2015-12-01':'2017-12-29',:] TotRet SpecRet ... volumedays IsTrading date order_book_id ...
2015-12-01 000155.XSHE 0.014580 0.007206 ... 53.0 1.0 2015-12-02 000155.XSHE -0.004785 0.002132 ... 53.0 1.0 2015-12-03 000155.XSHE 0.034894 0.006537 ... 53.0 1.0 2015-12-04 000155.XSHE 0.050001 0.021692 ... 53.0 1.0 2015-12-07 000155.XSHE 0.049832 0.026999 ... 53.0 1.0 ... ... ... ... ... 2017-12-25 000155.XSHE -0.049707 -0.053313 ... 5.0 1.0 2017-12-26 000155.XSHE -0.023771 -0.013894 ... 6.0 1.0 2017-12-27 000155.XSHE 0.006497 0.014989 ... 7.0 1.0 2017-12-28 000155.XSHE -0.030648 -0.026356 ... 8.0 1.0 2017-12-29 000155.XSHE 0.003335 -0.005979 ... 9.0 1.0

[104 rows x 112 columns]

rxx.groupby(level='order_book_id',group_keys=False).rolling('63d',on=rxx.index.levels[0]).mean() ***** ValueError: Length of values (567) does not match length of index (150)**

pd.show_versions()

INSTALLED VERSIONS

commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22000 machine : AMD64 processor : AMD64 Family 25 Model 33 Stepping 0, AuthenticAMD byteorder : little LC_ALL : None LANG : en LOCALE : English_United States.1252

pandas : 1.3.4 numpy : 1.21.2 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.2 setuptools : 58.0.4 Cython : 0.29.24 pytest : 6.2.4 hypothesis : None sphinx : 4.2.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : 1.0.2 psycopg2 : None jinja2 : 2.11.3 IPython : 7.29.0 pandas_datareader: None bs4 : 4.10.0 bottleneck : 1.3.2 fsspec : 2021.08.1 fastparquet : None gcsfs : None matplotlib : 3.4.3 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : 6.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.22 tables : 3.6.1 tabulate : 0.8.9 xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.53.1

drlakecoder avatar Nov 23 '21 04:11 drlakecoder

I have to concur. This seems like a useful thing to have. Or if there is an alternative way to handle this use case it could be described in docs.

jpodivin avatar Jul 30 '22 11:07 jpodivin

I've run into this issue as well.

The documentation for df.rolling() states on= should be: "a column label or Index level on which to calculate the rolling window". My expectation was that I could pass the name of a multiindex level and .rolling() would group rows by unique index level values. This all might be better handled by .groupby(), but I'd love to see more integrated multiindex handling where convenient.

d-otto avatar Dec 06 '22 18:12 d-otto

I have a similar issue, which again involves performing rolling actions on a dataframe with duplicate datetimes. I've gotten a complicated workaround using groupby and some other hacks, but a more native support would be greatly appreciated.

pollackscience avatar Dec 08 '22 14:12 pollackscience

The groupby transform method has this exact functionality. The docs explain it pretty well. Although I agree that the windowing API should definitely natively have this functionality.

Kalaweksh avatar Jan 20 '23 19:01 Kalaweksh

This is the solution wanted https://www.appsloveworld.com/pandas/100/251/pandas-dataframe-rolling-window-with-multiindex

claudia-buss avatar Feb 09 '23 15:02 claudia-buss

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

meta-ks avatar Nov 09 '23 04:11 meta-ks

The doc still mentions this as funtctionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue

jdkworld avatar Jan 04 '24 16:01 jdkworld

The doc still mentions this as functionality but it doesn't work:

src: Rolling doc

For a DataFrame, a column label or **Index level** on which to calculate the rolling window, rather than the DataFrame’s index.

This is still an issue however there is a hacky(?) work around.

Consider the following multi indexed series

 test.info()
<class 'pandas.core.series.Series'>
MultiIndex: 5442161 entries, ('1A1', Timestamp('2024-04-07 17:00:00.431000+0000', tz='UTC')) to ('3B9', Timestamp('2024-04-> 28 16:59:58.468000+0000', tz='UTC'))
Series name: supplytemp
Non-Null Count    Dtype  
--------------    -----  
5442159 non-null  float64
dtypes: float64(1)

test.index.names
FrozenList(['site', 'TimeStamp'])

Test['rolling_avg_supply_temp'] = Test.groupby(level='site').rolling('1h',on=Test.index.get_level_values(level='TimeStamp'))['supplytemp'].mean()

While this is a tad more verbose then pandas just acknowledging the closest index this is a workable replacement. The descriptive statistics for the series line up with this method and the non indexed method.

zdwhite avatar May 09 '24 18:05 zdwhite