polars icon indicating copy to clipboard operation
polars copied to clipboard

add ```apply``` func to groupby_rolling, groupby_dynamic for a dataframe context

Open yutiansut opened this issue 2 years ago • 3 comments

Describe your feature request

currently the groupby_rolling, groupby_dynamic can only agg pl.col context, but sometimes we need to apply func over the whole dataframe for analysis, for sure, we can use pl.apply([columns], func(pl.DataFrame(columns)) to reconstruct that, i remember in very old version there is an api call the the df.rolling(windows).apply can this be add to current api

yutiansut avatar May 08 '22 12:05 yutiansut

expected:

df.groupby_rolling(.....).apply( 

df.groupby_dynamic(.....).apply( 

maybe a __iter__ for groupbyclass?

yutiansut avatar May 08 '22 12:05 yutiansut

Do you mean the same functionality as the ordinary groupby has?

This is what it does:

    def apply(
        self,
        func: Callable[[Any], Any],
        return_dtype: Optional[Type[DataType]] = None,
    ) -> DF:
        """
        Apply a function over the groups.
        """
        df = self.agg_list()
        if self.selection is None:
            raise TypeError(
                "apply not available for Groupby.select_all(). Use select() instead."
            )
        for name in self.selection:
            s = df.drop_in_place(name + "_agg_list").apply(func, return_dtype)
            s.rename(name, in_place=True)
            df[name] = s

        return df

ritchie46 avatar May 11 '22 11:05 ritchie46

+1

Yes, I believe this is what I'm after also.

I'm wanting to do the following steps:

  1. Apply a groupby_rolling() to a data frame
  2. then, for each group/window produced by the groupby_rolling() I want to groupby()/groupby_dynamic() that
  3. so that I can apply various aggregation functions to those sub-groups and return as new columns for each row

To put that in more of a plain-english use case:

  • Suppose a data frame with a timeseries and float columns
  • For each row, I want to take a rolling window of the past 15min of data
  • For each rolling window, I want to sub-group that 15min into chunks of 5min
  • For each of those 5min chunks I want to calculate (for the float column): -- the mean of that chunk (and return as new columns (think col names like "5min_ago_mean", "10min_ago_mean" (or this could be one column containing a list etc; same-same)) -- the diff between the first and last value in each chunk (and return as new columns (e.g. "5_min_ago_diff"))

My mental model for this is that in order to do the sub-group aggregation I 'want' to treat each window from the groupby_rolling() like a dataframe (as opposed to single columns within a normal aggregation context) as I need to be able to sub-group based on the timestamps within the rolling window before I can apply any aggregations to the float column.

Please let me know if you want a more comprehensive example or if there's another approach/API method I should be looking at to achieve this.

Dermotholmes avatar Jan 02 '23 11:01 Dermotholmes

This exists as map_groups. There is also an __iter__ method for all group by types.

stinodego avatar Sep 08 '23 15:09 stinodego