polars icon indicating copy to clipboard operation
polars copied to clipboard

window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations

Open raoulj opened this issue 1 year ago • 11 comments

Description

Consider the following DataFrame:

dates = [
    "2020-01-01 13:45:48",
    "2020-01-01 16:42:13",
    "2020-01-01 16:45:09",
    "2020-01-02 18:12:48",
    "2020-01-03 19:45:32",
    "2020-01-08 23:16:43",
]
df = pl.DataFrame({"dt": dates, 'train_line': ['a', 'b', 'a', 'a', 'b', 'a'], "num_passengers": [3, 7, 5, 9, 2, 1]}).with_columns(
    pl.col("dt").str.strptime(pl.Datetime).set_sorted()
)
print(df)
shape: (6, 3)
┌─────────────────────┬────────────┬────────────────┐
│ departure_time      ┆ train_line ┆ num_passengers │
│ ---                 ┆ ---        ┆ ---            │
│ datetime[μs]        ┆ str        ┆ i64            │
╞═════════════════════╪════════════╪════════════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              │
└─────────────────────┴────────────┴────────────────┘

If I want to get the rolling average, at each departure, of the last 2 days worth of departures for each train line. Here's how I would think I would do that:

df.with_columns(pl.col('num_passengers').mean().over('train_line').rolling(index_column='departure_time', period='2d'))

But, doing this, I currently get InvalidOperationError: window expression not allowed in aggregation

raoulj avatar Oct 26 '23 14:10 raoulj

There is a dedicated Expr.rolling_mean() but there is an issue with it currently: https://github.com/pola-rs/polars/issues/11225

df.rolling(by="train_line", index_column="dt", period="2d").agg(
   pl.col("num_passengers").mean()
)

# shape: (6, 3)
# ┌────────────┬─────────────────────┬────────────────┐
# │ train_line ┆ dt                  ┆ num_passengers │
# │ ---        ┆ ---                 ┆ ---            │
# │ str        ┆ datetime[μs]        ┆ f64            │
# ╞════════════╪═════════════════════╪════════════════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ 3.0            │
# │ a          ┆ 2020-01-01 16:45:09 ┆ 4.0            │
# │ a          ┆ 2020-01-02 18:12:48 ┆ 5.666667       │
# │ a          ┆ 2020-01-08 23:16:43 ┆ 1.0            │
# │ b          ┆ 2020-01-01 16:42:13 ┆ 7.0            │
# │ b          ┆ 2020-01-03 19:45:32 ┆ 2.0            │
# └────────────┴─────────────────────┴────────────────┘

cmdlineluser avatar Oct 26 '23 15:10 cmdlineluser

@cmdlineluser appreciate the pointer! I may have oversimplified my example. The motivating case I encountered is a dynamic threshold with an .all() aggregation. In the spirit on the provided example:

Say we have:

┌─────────────────────┬────────────┬────────────────┬───────────┐
│ departure_time      ┆ train_line ┆ num_passengers ┆ threshold │
│ ---                 ┆ ---        ┆ ---            ┆ ---       │
│ datetime[μs]        ┆ str        ┆ i64            ┆ i64       │
╞═════════════════════╪════════════╪════════════════╪═══════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              ┆ 2         │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              ┆ 4         │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              ┆ 1         │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              ┆ 3         │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              ┆ 2         │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              ┆ 4         │
└─────────────────────┴────────────┴────────────────┴───────────┘

How would I filter to train departures that had more than the threshold number of riders for every ride in the last 2 days? I can't use rolling_min because the threshold changes every departure.

raoulj avatar Oct 26 '23 15:10 raoulj

That would be especially helpful if we can use expressions like that

pl.col().any_method().rolling().over()

xyk2000 avatar Oct 26 '23 16:10 xyk2000

@xyk2000 messaging here to not clutter that other thread

I'm not sure if https://github.com/pola-rs/polars/issues/12049 would help this? That's talking about breaking up the rolling api, which is different than allowing nested window functions like this is.

raoulj avatar Oct 26 '23 18:10 raoulj

Do you mean something like this @raoulj ?

df.rolling("departure_time", by="train_line", period="2d").agg(
   pl.exclude("departure_time"),
   all = (pl.col("num_passengers") > pl.col("threshold")).all()
)

# shape: (6, 5)
# ┌────────────┬─────────────────────┬────────────────┬───────────┬───────┐
# │ train_line ┆ departure_time      ┆ num_passengers ┆ threshold ┆ all   │
# │ ---        ┆ ---                 ┆ ---            ┆ ---       ┆ ---   │
# │ str        ┆ datetime[μs]        ┆ list[i64]      ┆ list[i64] ┆ bool  │
# ╞════════════╪═════════════════════╪════════════════╪═══════════╪═══════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ [3]            ┆ [2]       ┆ true  │
# │ a          ┆ 2020-01-01 16:45:09 ┆ [3, 5]         ┆ [2, 1]    ┆ true  │
# │ a          ┆ 2020-01-02 18:12:48 ┆ [3, 5, 9]      ┆ [2, 1, 3] ┆ true  │
# │ a          ┆ 2020-01-08 23:16:43 ┆ [1]            ┆ [4]       ┆ false │
# │ b          ┆ 2020-01-01 16:42:13 ┆ [7]            ┆ [4]       ┆ true  │
# │ b          ┆ 2020-01-03 19:45:32 ┆ [2]            ┆ [2]       ┆ false │
# └────────────┴─────────────────────┴────────────────┴───────────┴───────┘

Apologies if I've misunderstood.

cmdlineluser avatar Oct 26 '23 22:10 cmdlineluser

No that is exactly what I wanted! Didn't know about DataFrame.rolling(). Thank you for pointing me in the right direction.

Is this by= property available on Expr.rolling()? I ask because the property is currently mentioned in the check_sorted property. Looking at the src (which I am looking at for the first time) it looks like there's no by implementation even though check_sorted is an arg.

raoulj avatar Oct 26 '23 22:10 raoulj

https://github.com/pola-rs/polars/pull/11445#discussion_r1342120726

No.. because the by argument would need to reorder the other columns or this output. For the by argument case we need groupby_rolling (soon the rolling) context.

(DataFrame.group_by_rolling() was recently renamed to DataFrame.rolling())

cmdlineluser avatar Oct 26 '23 22:10 cmdlineluser

Okay. So the Expr.rolling() docs are temporarily incorrect while the new rolling context is developed. Thanks for the context.

I do like the .rolling().over() syntax. Unsure if that's in scope for the rolling context mentioned in the linked PR. Is there a public facing roadmap anywhere where I could understand this effort?

raoulj avatar Oct 27 '23 17:10 raoulj

Not sure if related, but today I was trying do complex opeation:

pl.col('value').pct_change().over('entityId').rank().over('date')

On a table a bit like:

date entityId value
2020-01-01 "K" 7
2020-01-02 "K" 8
2020-01-03 "K" 9
2020-01-01 "G" 5
2020-01-02 "G" 12
2020-01-03 "G" 7

The expression is invalid due to

InvalidOperationError: window expression not allowed in aggregation

But not clear how to do the same at the pl.Expr level.

mkleinbort-ic avatar Nov 22 '23 07:11 mkleinbort-ic

I've run into this issue as well, as a workaround, what I do is ensure my express that does the rolling happens in an earlier with_columns and instead of directly depending on that expression, I reference the output of the first rolling express by name/alias.

It's a bit of a footgun, would love to see this limitation removed.

kszlim avatar Jan 11 '24 08:01 kszlim

I would like to add some traffic to this feature request. It would be very convenient to have possibility either for pl.Expr().rolling().over() or group_by parameter within the pl.Expr().rolling() function Currently, the pl.DataFrame.rolling() is the only way to get the desired output with the group_by parameter for ops such as n_unique. This, however, forces user to then join the result(s) back to the original frame in case a new column is desired (and possibly raises the need to keep some form of index on which to join back in specific cases). The inconvenience is even more prevalent in cases when one would want to create multiple rolling-window-based features either for different groupings or for different time settings of the rolling window.

t-ded avatar May 16 '24 12:05 t-ded