narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

feat: add required `keep` argument to `Expr.mode`

Open MarcoGorelli opened this issue 11 months ago • 18 comments

We currently have Expr.mode and Series.mode, like Polars does. What's not great about them is that they don't aggregate

I'd like to suggest a departure from Polars here. We introduce Expr.mode_any and Series.mode_any, which actually aggregate, but make no guarantees about which mode will be returned. For example, if a column contains values [1, 2, 2, 3, 3, 4], then it may return 2 or it may return 3.

EDIT

Concrete suggestion:

We could add a (required!, as this would be a break from the Polars API) argument keep to it, which could be:

  • 'first'
  • 'last'
  • 'all': keep all values. Only supported for eager backends. This would not return an aggregation, so applying nw.col('a', 'b').mode() may raise ShapeError if 'a' and 'b' have a different number of modes. But, I think that's OK
  • 'any'

'any' could be supported by all backends, the others only by eager ones

Then, for all keep values except 'all', mode would always return an aggregation, and this would solve a few problems

One annoyance is that there doesn't seem to be a groupby-mode function in pandas 😩 🤦 But, there is a nice workaround in https://github.com/pandas-dev/pandas/issues/19254 which we might be able to use?

MarcoGorelli avatar Jan 11 '25 15:01 MarcoGorelli

Alternatively, per https://github.com/narwhals-dev/narwhals/discussions/1657, if we disallow length-changing-non-aggregating expressions which aren't followed by aggregations, then we might be able to get away with just Expr.mode, and just require that it be followed by an aggregation, .e. nw.col('a').mode().first(order_by='id'). We could even introduce Expr.any_value and allow nw.col('a').mode().any_value()

MarcoGorelli avatar Jan 17 '25 10:01 MarcoGorelli

Returning a list would definitly be consistent with polars mode behavior in a group by context. We might always return a list type (and enhance a bit the list namespace with .first, .last, .__getitem__)

FBruzzesi avatar Jan 18 '25 09:01 FBruzzesi

I'm not sold on returning lists here for expressions which don't aggregate, and I think Polars is a bit internally inconsistent here. .list is also a bit problematic for some backends, like pandas non-pyarrow-backed, whereas with nw.col('a').mode().any_value() / nw.col('a').mode().first(order_by='id') we can bypass the list and have a return value which works well for all implementations

MarcoGorelli avatar Jan 18 '25 10:01 MarcoGorelli

https://github.com/narwhals-dev/narwhals/issues/2273#issuecomment-2747898004

A few months later, I think I'm leaning towards a require keep argument to mode, so that if keep='any' then mode always aggregates, and that solves a few issues

MarcoGorelli avatar Mar 24 '25 12:03 MarcoGorelli

#2273 (comment)

A few months later, I think I'm leaning towards a require keep argument to mode, so that if keep='any' then mode always aggregates, and that solves a few issues

I am in favor of it but, ideally, i would like to see a solution when keep='all' (for eager ones) and columns have a different number of modes. what if in that case we give a struct back?

skritsotalakis avatar Apr 03 '25 17:04 skritsotalakis

@MarcoGorelli have you thought about adding Expr.(first|last) and requiring that after instead?

  • https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.first.html#polars.Expr.first
  • https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.last.html#polars.Expr.last

That way we don't diverge from polars, and I imagine we could use that in other places to resolve ambiguity?

Edit

Turns out first|last aren't stable enough, but requiring some aggregation - like min|max could work?

import polars as pl
from polars import col

df = pl.DataFrame({"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]})
mode = col("a").mode()

>>> df.select(
    mode.first().alias("first"),
    mode.last().alias("last"),
    mode.min().alias("min"),
    mode.max().alias("max"),
)
shape: (1, 4)
┌───────┬──────┬─────┬─────┐
│ first ┆ last ┆ min ┆ max │
│ ---   ┆ ---  ┆ --- ┆ --- │
│ i64   ┆ i64  ┆ i64 ┆ i64 │
╞═══════╪══════╪═════╪═════╡
│ 2     ┆ 2    ┆ 2   ┆ 3   │
└───────┴──────┴─────┴─────┘

shape: (1, 4)
┌───────┬──────┬─────┬─────┐
│ first ┆ last ┆ min ┆ max │
│ ---   ┆ ---  ┆ --- ┆ --- │
│ i64   ┆ i64  ┆ i64 ┆ i64 │
╞═══════╪══════╪═════╪═════╡
│ 3     ┆ 2    ┆ 2   ┆ 3   │
└───────┴──────┴─────┴─────┘

dangotbanned avatar Apr 03 '25 18:04 dangotbanned

Thanks both!

I am in favor of it but, ideally, i would like to see a solution when keep='all' (for eager ones) and columns have a different number of modes

in the eager case, Series is supported, so people can get their own Series out, which don't have to have the same length

struct still requires the length lengths

have you thought about adding Expr.(first|last) and requiring that after instead?

yeah but i think we may need some divergence from polars anyway, because when order isn't defined, we'd want to require something like .mode().any_value(), and Polars doesn't have any_value either

MarcoGorelli avatar Apr 03 '25 18:04 MarcoGorelli

have you thought about adding Expr.(first|last) and requiring that after instead?

yeah but i think we may need some divergence from polars anyway, because when order isn't defined, we'd want to require something like .mode().any_value(), and Polars doesn't have any_value either

@MarcoGorelli we could do:

pl.col("a").mode().sample(1)

dangotbanned avatar Apr 03 '25 18:04 dangotbanned

true, but then Expr.sample(n) with n>1 is something that we couldn't support for any non-polars lazy backend, so we'd be adding it so it can only be called as sample(1)

MarcoGorelli avatar Apr 03 '25 18:04 MarcoGorelli

true, but then Expr.sample(n) with n>1 is something that we couldn't support for any non-polars lazy backend, so we'd be adding it so it can only be called as sample(1)

Also true 🤔

Have there been any attempts upstream to get polars to extend Expr.mode?

Related

  • [ ] https://github.com/pola-rs/polars/issues/19653
  • [ ] https://github.com/pola-rs/polars/issues/14363

dangotbanned avatar Apr 03 '25 18:04 dangotbanned

i don't think so, but polars makes more ordering guarantees than sql so perhaps there just having .first is enough

MarcoGorelli avatar Apr 03 '25 20:04 MarcoGorelli

I think we can do this, but keep doesn't need to be a required argument, so that nw.col('a').mode() does the same thing as pl.col('a').mode()

In which case, this would be a backwards-compatible change

MarcoGorelli avatar May 20 '25 10:05 MarcoGorelli

I think we can do this, but keep doesn't need to be a required argument, so that nw.col('a').mode() does the same thing as pl.col('a').mode()

In which case, this would be a backwards-compatible change

@MarcoGorelli Is the reason we'd want this as a single method, so that it could be used in group_by().agg(...) - without triggering a complex aggregation?

I feel like if that wasn't a factor, any of the options in stable would be fine (providing we implement them 😄):

Code block

import polars as pl
from polars import col

data = {"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]}
df = pl.DataFrame(data)

a = col("a")
b = col("b")
mode = col("a").mode()

stable = (
    mode.min().alias("mode->min"),
    a.sort().mode().first().alias("sort->mode->first"),
    a.sort().mode().get(0).alias("sort->mode->get(0)"),
    mode.max().alias("mode->max"),
    a.sort().mode().last().alias("sort->mode->last"),
    a.sort().mode().get(-1).alias("sort->mode->get(-1)"),
)

unstable = (
    mode.first().alias("mode->first"),
    mode.last().alias("mode->last"),
    mode.get(0).alias("mode->get(0)"),
    a.sort_by(a).mode().first().alias("sort_by(self)->mode->first"),
    a.sort_by(b).mode().first().alias("sort_by(other)->mode->first"),
)

with pl.Config(tbl_width_chars=200):
    print("Stable", df.select(stable), sep="\n")
    print("Unstable", df.select(unstable), sep="\n")
Outputs

Stable
shape: (1, 6)
┌───────────┬───────────────────┬────────────────────┬───────────┬──────────────────┬─────────────────────┐
│ mode->min ┆ sort->mode->first ┆ sort->mode->get(0) ┆ mode->max ┆ sort->mode->last ┆ sort->mode->get(-1) │
│ ---       ┆ ---               ┆ ---                ┆ ---       ┆ ---              ┆ ---                 │
│ i64       ┆ i64               ┆ i64                ┆ i64       ┆ i64              ┆ i64                 │
╞═══════════╪═══════════════════╪════════════════════╪═══════════╪══════════════════╪═════════════════════╡
│ 2         ┆ 2                 ┆ 2                  ┆ 3         ┆ 3                ┆ 3                   │
└───────────┴───────────────────┴────────────────────┴───────────┴──────────────────┴─────────────────────┘
Unstable
shape: (1, 5)
┌─────────────┬────────────┬──────────────┬────────────────────────────┬─────────────────────────────┐
│ mode->first ┆ mode->last ┆ mode->get(0) ┆ sort_by(self)->mode->first ┆ sort_by(other)->mode->first │
│ ---         ┆ ---        ┆ ---          ┆ ---                        ┆ ---                         │
│ i64         ┆ i64        ┆ i64          ┆ i64                        ┆ i64                         │
╞═════════════╪════════════╪══════════════╪════════════════════════════╪═════════════════════════════╡
│ 2           ┆ 2          ┆ 3            ┆ 2                          ┆ 3                           │
└─────────────┴────────────┴──────────────┴────────────────────────────┴─────────────────────────────┘

FWIW, pd.Series.mode always returns a pd.Series - with no keep behavior

import pandas as pd

data = {"a": [1, 2, 2, 3, 3, 4], "b": [1, 2, 3, 4, 5, 6]}

>>> pd.Series(data["a"]).mode()
0    2
1    3
dtype: int64

I do still think raising/adding to an upstream issue (https://github.com/narwhals-dev/narwhals/issues/1793#issuecomment-2776634007) would be helpful 🙏 Even if it just leads to a clarification on why the current pl.Expr.mode behavior is desirable upstream

dangotbanned avatar May 20 '25 16:05 dangotbanned

If we had .any_value(), then nw.col('a').mode().any_value() would do, but that would still be a deviation from Polars

.mode().first() suffices in Polars, but I don't think it really suffices for us (for a start, for lazy frames we shouldn't be making guarantees about what comes first)

MarcoGorelli avatar May 20 '25 16:05 MarcoGorelli

If we had .any_value(), then nw.col('a').mode().any_value() would do, but that would still be a deviation from Polars

.mode().first() suffices in Polars, but I don't think it really suffices for us (for a start, for lazy frames we shouldn't be making guarantees about what comes first)

@MarcoGorelli did you see this? https://github.com/narwhals-dev/narwhals/pull/2528#discussion_r2083300327

duckdb behaves the sames as polars for first without an explicit sort/order by

  • https://duckdb.org/docs/stable/sql/functions/aggregates#arbitraryarg

If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔

dangotbanned avatar May 20 '25 16:05 dangotbanned

If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔

i'm not really keen on this, assuming order when it's not there is a common source of errors

MarcoGorelli avatar May 20 '25 16:05 MarcoGorelli

If we didn't restrict first to needing an order (and just documented it's behavior) then we'd already have an any_value wouldn't we? 🤔

i'm not really keen on this, assuming order when it's not there is a common source of errors

Wouldn't nw.col("a").mode(keep="first") also require assuming an order?

In polars, you'd still need to do pl.col("a").sort().mode().first() to get the first value in a predictable way

dangotbanned avatar May 20 '25 17:05 dangotbanned

Yes but mode(keep='any') wouldn't

MarcoGorelli avatar May 20 '25 17:05 MarcoGorelli

https://github.com/narwhals-dev/narwhals/issues/1793#event-18368157011

😄

~~feat: add ~~required~~ keep argument to Expr.mode~~ feat: add optional keep argument to Expr.mode

dangotbanned avatar Jun 28 '25 10:06 dangotbanned