datatable
datatable copied to clipboard
Rolling aggregate support based on windows within a DT
I'd like to see the ability to get different rolling aggregations of my dataset based on order and grouping columns. Pandas has robust support for these type of actions. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html.
It would also be nice to easily assign these to new columns without having to make nested for loops.
Below is a function I use in Pandas to achieve this functionality
def create_roll_columns (x,g_c,roll,roll_cols,roll_types):
#roll types [sum,mean,min]
#g_c needs to be a list
#lag will set the number of lag cols
for i in roll_cols:
for j in range(2,roll+1):
for aggregation in roll_types:
if aggregation == 'sum':
nm = '{0}_sum_{1}'.format(i, str(j))
x[nm] = x.groupby(g_c)[i].rolling(j, min_periods=1).sum().reset_index(0,drop=True)
if aggregation == 'mean':
nm = '{0}_mean_{1}'.format(i, str(j))
x[nm] = x.groupby(g_c)[i].rolling(j, min_periods=1).mean().reset_index(0,drop=True)
if aggregation == 'min':
nm = '{0}_min_{1}'.format(i, str(j))
x[nm] = x.groupby(g_c)[i].rolling(j, min_periods=1).min().reset_index(0,drop=True)
return(x)
Example code could be something like - df(select=mean(f.x), group="y",window = 3, align='r')
-group = windowing columns for the select statement
-window = many rows is looks back or forward in a given group
- Align is right, left, center and coordinates where the window happens relative to the given record.
I'm going to remove lead/lag from here, since it is a separate feature, and deserves its own issue.
The rolling functions are an important feature to add; there is a similar work being done in R data.table as well: https://github.com/Rdatatable/data.table/issues/2778
The R data.table implementation (as of now rollmean) was designed to be R agnostic, so we should be able to reduce maintanence effort by reusing that code.
@st-pasha Do you want me to raise a ticket about lead/lag?
@atroiano Yes, please do. I guess the function name will be shift(n)
, in concord with R data.table, and also because it nicely allows both positive and negative n
s.
In R we use zoo or own implementation of rolling function with next syntax:
DT[, ref_price := roll(price, fill = T, width = 16, fun = max), by = .(banner, region, sku)]
Is there way to implement with similar syntax?