polars
polars copied to clipboard
window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations
Description
Consider the following DataFrame:
dates = [
"2020-01-01 13:45:48",
"2020-01-01 16:42:13",
"2020-01-01 16:45:09",
"2020-01-02 18:12:48",
"2020-01-03 19:45:32",
"2020-01-08 23:16:43",
]
df = pl.DataFrame({"dt": dates, 'train_line': ['a', 'b', 'a', 'a', 'b', 'a'], "num_passengers": [3, 7, 5, 9, 2, 1]}).with_columns(
pl.col("dt").str.strptime(pl.Datetime).set_sorted()
)
print(df)
shape: (6, 3)
┌─────────────────────┬────────────┬────────────────┐
│ departure_time ┆ train_line ┆ num_passengers │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪════════════╪════════════════╡
│ 2020-01-01 13:45:48 ┆ a ┆ 3 │
│ 2020-01-01 16:42:13 ┆ b ┆ 7 │
│ 2020-01-01 16:45:09 ┆ a ┆ 5 │
│ 2020-01-02 18:12:48 ┆ a ┆ 9 │
│ 2020-01-03 19:45:32 ┆ b ┆ 2 │
│ 2020-01-08 23:16:43 ┆ a ┆ 1 │
└─────────────────────┴────────────┴────────────────┘
If I want to get the rolling average, at each departure, of the last 2 days worth of departures for each train line. Here's how I would think I would do that:
df.with_columns(pl.col('num_passengers').mean().over('train_line').rolling(index_column='departure_time', period='2d'))
But, doing this, I currently get InvalidOperationError: window expression not allowed in aggregation
There is a dedicated Expr.rolling_mean()
but there is an issue with it currently: https://github.com/pola-rs/polars/issues/11225
df.rolling(by="train_line", index_column="dt", period="2d").agg(
pl.col("num_passengers").mean()
)
# shape: (6, 3)
# ┌────────────┬─────────────────────┬────────────────┐
# │ train_line ┆ dt ┆ num_passengers │
# │ --- ┆ --- ┆ --- │
# │ str ┆ datetime[μs] ┆ f64 │
# ╞════════════╪═════════════════════╪════════════════╡
# │ a ┆ 2020-01-01 13:45:48 ┆ 3.0 │
# │ a ┆ 2020-01-01 16:45:09 ┆ 4.0 │
# │ a ┆ 2020-01-02 18:12:48 ┆ 5.666667 │
# │ a ┆ 2020-01-08 23:16:43 ┆ 1.0 │
# │ b ┆ 2020-01-01 16:42:13 ┆ 7.0 │
# │ b ┆ 2020-01-03 19:45:32 ┆ 2.0 │
# └────────────┴─────────────────────┴────────────────┘
@cmdlineluser appreciate the pointer! I may have oversimplified my example. The motivating case I encountered is a dynamic threshold with an .all()
aggregation. In the spirit on the provided example:
Say we have:
┌─────────────────────┬────────────┬────────────────┬───────────┐
│ departure_time ┆ train_line ┆ num_passengers ┆ threshold │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪════════════╪════════════════╪═══════════╡
│ 2020-01-01 13:45:48 ┆ a ┆ 3 ┆ 2 │
│ 2020-01-01 16:42:13 ┆ b ┆ 7 ┆ 4 │
│ 2020-01-01 16:45:09 ┆ a ┆ 5 ┆ 1 │
│ 2020-01-02 18:12:48 ┆ a ┆ 9 ┆ 3 │
│ 2020-01-03 19:45:32 ┆ b ┆ 2 ┆ 2 │
│ 2020-01-08 23:16:43 ┆ a ┆ 1 ┆ 4 │
└─────────────────────┴────────────┴────────────────┴───────────┘
How would I filter to train departures that had more than the threshold number of riders for every ride in the last 2 days? I can't use rolling_min
because the threshold changes every departure.
That would be especially helpful if we can use expressions like that
pl.col().any_method().rolling().over()
@xyk2000 messaging here to not clutter that other thread
I'm not sure if https://github.com/pola-rs/polars/issues/12049 would help this? That's talking about breaking up the rolling api, which is different than allowing nested window functions like this is.
Do you mean something like this @raoulj ?
df.rolling("departure_time", by="train_line", period="2d").agg(
pl.exclude("departure_time"),
all = (pl.col("num_passengers") > pl.col("threshold")).all()
)
# shape: (6, 5)
# ┌────────────┬─────────────────────┬────────────────┬───────────┬───────┐
# │ train_line ┆ departure_time ┆ num_passengers ┆ threshold ┆ all │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ datetime[μs] ┆ list[i64] ┆ list[i64] ┆ bool │
# ╞════════════╪═════════════════════╪════════════════╪═══════════╪═══════╡
# │ a ┆ 2020-01-01 13:45:48 ┆ [3] ┆ [2] ┆ true │
# │ a ┆ 2020-01-01 16:45:09 ┆ [3, 5] ┆ [2, 1] ┆ true │
# │ a ┆ 2020-01-02 18:12:48 ┆ [3, 5, 9] ┆ [2, 1, 3] ┆ true │
# │ a ┆ 2020-01-08 23:16:43 ┆ [1] ┆ [4] ┆ false │
# │ b ┆ 2020-01-01 16:42:13 ┆ [7] ┆ [4] ┆ true │
# │ b ┆ 2020-01-03 19:45:32 ┆ [2] ┆ [2] ┆ false │
# └────────────┴─────────────────────┴────────────────┴───────────┴───────┘
Apologies if I've misunderstood.
No that is exactly what I wanted! Didn't know about DataFrame.rolling(). Thank you for pointing me in the right direction.
Is this by=
property available on Expr.rolling()? I ask because the property is currently mentioned in the check_sorted
property. Looking at the src (which I am looking at for the first time) it looks like there's no by
implementation even though check_sorted
is an arg.
https://github.com/pola-rs/polars/pull/11445#discussion_r1342120726
No.. because the
by
argument would need to reorder the other columns or this output. For theby
argument case we need groupby_rolling (soon the rolling) context.
(DataFrame.group_by_rolling()
was recently renamed to DataFrame.rolling()
)
Okay. So the Expr.rolling()
docs are temporarily incorrect while the new rolling context is developed. Thanks for the context.
I do like the .rolling().over()
syntax. Unsure if that's in scope for the rolling context mentioned in the linked PR. Is there a public facing roadmap anywhere where I could understand this effort?
Not sure if related, but today I was trying do complex opeation:
pl.col('value').pct_change().over('entityId').rank().over('date')
On a table a bit like:
date | entityId | value |
---|---|---|
2020-01-01 | "K" | 7 |
2020-01-02 | "K" | 8 |
2020-01-03 | "K" | 9 |
2020-01-01 | "G" | 5 |
2020-01-02 | "G" | 12 |
2020-01-03 | "G" | 7 |
The expression is invalid due to
InvalidOperationError: window expression not allowed in aggregation
But not clear how to do the same at the pl.Expr
level.
I've run into this issue as well, as a workaround, what I do is ensure my express that does the rolling
happens in an earlier with_columns
and instead of directly depending on that expression, I reference the output of the first rolling
express by name/alias.
It's a bit of a footgun, would love to see this limitation removed.
I would like to add some traffic to this feature request.
It would be very convenient to have possibility either for pl.Expr().rolling().over()
or group_by
parameter within the pl.Expr().rolling()
function
Currently, the pl.DataFrame.rolling()
is the only way to get the desired output with the group_by
parameter for ops such as n_unique
. This, however, forces user to then join the result(s) back to the original frame in case a new column is desired (and possibly raises the need to keep some form of index on which to join back in specific cases). The inconvenience is even more prevalent in cases when one would want to create multiple rolling-window-based features either for different groupings or for different time settings of the rolling window.