feat: add required `keep` argument to `Expr.mode`
We currently have Expr.mode and Series.mode, like Polars does. What's not great about them is that they don't aggregate
I'd like to suggest a departure from Polars here. We introduce Expr.mode_any and Series.mode_any, which actually aggregate, but make no guarantees about which mode will be returned. For example, if a column contains values [1, 2, 2, 3, 3, 4], then it may return 2 or it may return 3.
EDIT
Concrete suggestion:
We could add a (required!, as this would be a break from the Polars API) argument keep to it, which could be:
- 'first'
- 'last'
- 'all': keep all values. Only supported for eager backends. This would not return an aggregation, so applying
nw.col('a', 'b').mode()may raiseShapeErrorif 'a' and 'b' have a different number of modes. But, I think that's OK - 'any'
'any' could be supported by all backends, the others only by eager ones
Then, for all keep values except 'all', mode would always return an aggregation, and this would solve a few problems
One annoyance is that there doesn't seem to be a groupby-mode function in pandas 😩 🤦 But, there is a nice workaround in https://github.com/pandas-dev/pandas/issues/19254 which we might be able to use?
Alternatively, per https://github.com/narwhals-dev/narwhals/discussions/1657, if we disallow length-changing-non-aggregating expressions which aren't followed by aggregations, then we might be able to get away with just Expr.mode, and just require that it be followed by an aggregation, .e. nw.col('a').mode().first(order_by='id'). We could even introduce Expr.any_value and allow nw.col('a').mode().any_value()
Returning a list would definitly be consistent with polars mode behavior in a group by context. We might always return a list type (and enhance a bit the list namespace with .first, .last, .__getitem__)
I'm not sold on returning lists here for expressions which don't aggregate, and I think Polars is a bit internally inconsistent here. .list is also a bit problematic for some backends, like pandas non-pyarrow-backed, whereas with nw.col('a').mode().any_value() / nw.col('a').mode().first(order_by='id') we can bypass the list and have a return value which works well for all implementations
https://github.com/narwhals-dev/narwhals/issues/2273#issuecomment-2747898004
A few months later, I think I'm leaning towards a require keep argument to mode, so that if keep='any' then mode always aggregates, and that solves a few issues
A few months later, I think I'm leaning towards a require
keepargument tomode, so that ifkeep='any'thenmodealways aggregates, and that solves a few issues
I am in favor of it but, ideally, i would like to see a solution when keep='all' (for eager ones) and columns have a different number of modes. what if in that case we give a struct back?
@MarcoGorelli have you thought about adding Expr.(first|last) and requiring that after instead?
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.first.html#polars.Expr.first
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.last.html#polars.Expr.last
That way we don't diverge from polars, and I imagine we could use that in other places to resolve ambiguity?
Edit
Turns out first|last aren't stable enough, but requiring some aggregation - like min|max could work?
import polars as pl
from polars import col
df = pl.DataFrame({"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]})
mode = col("a").mode()
>>> df.select(
mode.first().alias("first"),
mode.last().alias("last"),
mode.min().alias("min"),
mode.max().alias("max"),
)
shape: (1, 4)
┌───────┬──────┬─────┬─────┐
│ first ┆ last ┆ min ┆ max │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════╪═════╪═════╡
│ 2 ┆ 2 ┆ 2 ┆ 3 │
└───────┴──────┴─────┴─────┘
shape: (1, 4)
┌───────┬──────┬─────┬─────┐
│ first ┆ last ┆ min ┆ max │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪══════╪═════╪═════╡
│ 3 ┆ 2 ┆ 2 ┆ 3 │
└───────┴──────┴─────┴─────┘
Thanks both!
I am in favor of it but, ideally, i would like to see a solution when keep='all' (for eager ones) and columns have a different number of modes
in the eager case, Series is supported, so people can get their own Series out, which don't have to have the same length
struct still requires the length lengths
have you thought about adding Expr.(first|last) and requiring that after instead?
yeah but i think we may need some divergence from polars anyway, because when order isn't defined, we'd want to require something like .mode().any_value(), and Polars doesn't have any_value either
have you thought about adding Expr.(first|last) and requiring that after instead?
yeah but i think we may need some divergence from polars anyway, because when order isn't defined, we'd want to require something like
.mode().any_value(), and Polars doesn't haveany_valueeither
@MarcoGorelli we could do:
pl.col("a").mode().sample(1)
true, but then Expr.sample(n) with n>1 is something that we couldn't support for any non-polars lazy backend, so we'd be adding it so it can only be called as sample(1)
true, but then
Expr.sample(n)withn>1is something that we couldn't support for any non-polars lazy backend, so we'd be adding it so it can only be called assample(1)
Also true 🤔
Have there been any attempts upstream to get polars to extend Expr.mode?
Related
- [ ] https://github.com/pola-rs/polars/issues/19653
- [ ] https://github.com/pola-rs/polars/issues/14363
i don't think so, but polars makes more ordering guarantees than sql so perhaps there just having .first is enough
I think we can do this, but keep doesn't need to be a required argument, so that nw.col('a').mode() does the same thing as pl.col('a').mode()
In which case, this would be a backwards-compatible change
I think we can do this, but
keepdoesn't need to be a required argument, so thatnw.col('a').mode()does the same thing aspl.col('a').mode()In which case, this would be a backwards-compatible change
@MarcoGorelli Is the reason we'd want this as a single method, so that it could be used in group_by().agg(...) - without triggering a complex aggregation?
I feel like if that wasn't a factor, any of the options in stable would be fine (providing we implement them 😄):
Code block
import polars as pl
from polars import col
data = {"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]}
df = pl.DataFrame(data)
a = col("a")
b = col("b")
mode = col("a").mode()
stable = (
mode.min().alias("mode->min"),
a.sort().mode().first().alias("sort->mode->first"),
a.sort().mode().get(0).alias("sort->mode->get(0)"),
mode.max().alias("mode->max"),
a.sort().mode().last().alias("sort->mode->last"),
a.sort().mode().get(-1).alias("sort->mode->get(-1)"),
)
unstable = (
mode.first().alias("mode->first"),
mode.last().alias("mode->last"),
mode.get(0).alias("mode->get(0)"),
a.sort_by(a).mode().first().alias("sort_by(self)->mode->first"),
a.sort_by(b).mode().first().alias("sort_by(other)->mode->first"),
)
with pl.Config(tbl_width_chars=200):
print("Stable", df.select(stable), sep="\n")
print("Unstable", df.select(unstable), sep="\n")
Outputs
Stable
shape: (1, 6)
┌───────────┬───────────────────┬────────────────────┬───────────┬──────────────────┬─────────────────────┐
│ mode->min ┆ sort->mode->first ┆ sort->mode->get(0) ┆ mode->max ┆ sort->mode->last ┆ sort->mode->get(-1) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═══════════════════╪════════════════════╪═══════════╪══════════════════╪═════════════════════╡
│ 2 ┆ 2 ┆ 2 ┆ 3 ┆ 3 ┆ 3 │
└───────────┴───────────────────┴────────────────────┴───────────┴──────────────────┴─────────────────────┘
Unstable
shape: (1, 5)
┌─────────────┬────────────┬──────────────┬────────────────────────────┬─────────────────────────────┐
│ mode->first ┆ mode->last ┆ mode->get(0) ┆ sort_by(self)->mode->first ┆ sort_by(other)->mode->first │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════════╪════════════╪══════════════╪════════════════════════════╪═════════════════════════════╡
│ 2 ┆ 2 ┆ 3 ┆ 2 ┆ 3 │
└─────────────┴────────────┴──────────────┴────────────────────────────┴─────────────────────────────┘
FWIW, pd.Series.mode always returns a pd.Series - with no keep behavior
import pandas as pd
data = {"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]}
>>> pd.Series(data["a"]).mode()
0 2
1 3
dtype: int64
I do still think raising/adding to an upstream issue (https://github.com/narwhals-dev/narwhals/issues/1793#issuecomment-2776634007) would be helpful 🙏
Even if it just leads to a clarification on why the current pl.Expr.mode behavior is desirable upstream
If we had .any_value(), then nw.col('a').mode().any_value() would do, but that would still be a deviation from Polars
.mode().first() suffices in Polars, but I don't think it really suffices for us (for a start, for lazy frames we shouldn't be making guarantees about what comes first)
If we had
.any_value(), thennw.col('a').mode().any_value()would do, but that would still be a deviation from Polars
.mode().first()suffices in Polars, but I don't think it really suffices for us (for a start, for lazy frames we shouldn't be making guarantees about what comes first)
@MarcoGorelli did you see this? https://github.com/narwhals-dev/narwhals/pull/2528#discussion_r2083300327
duckdb behaves the sames as polars for first without an explicit sort/order by
- https://duckdb.org/docs/stable/sql/functions/aggregates#arbitraryarg
If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔
If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔
i'm not really keen on this, assuming order when it's not there is a common source of errors
If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔
i'm not really keen on this, assuming order when it's not there is a common source of errors
Wouldn't nw.col("a").mode(keep="first") also require assuming an order?
In polars, you'd still need to do pl.col("a").sort().mode().first() to get the first value in a predictable way
Yes but mode(keep='any') wouldn't
https://github.com/narwhals-dev/narwhals/issues/1793#event-18368157011
😄
~~feat: add ~~required~~ keep argument to Expr.mode~~
feat: add optional keep argument to Expr.mode