polars icon indicating copy to clipboard operation
polars copied to clipboard

Add `top_k` in the GroupBy namespace

Open tzeitim opened this issue 1 year ago • 3 comments

Problem description

It would be useful to include a top_k function to the GroupBy namespace.

Toy data frame

df = pl.DataFrame(
    {
        "a": [1, 2, 2, 3, 4, 5],
        "b": [5.5, 0.5, 4, 10, 13, 17],
        "c": [True, True, True, False, False, True],
        "d": ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"],
    }
)
shape: (6, 4)
┌─────┬──────┬───────┬────────┐
│ a   ┆ b    ┆ c     ┆ d      │
│ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ f64  ┆ bool  ┆ str    │
╞═════╪══════╪═══════╪════════╡
│ 1   ┆ 5.5  ┆ true  ┆ Apple  │
│ 2   ┆ 0.5  ┆ true  ┆ Orange │
│ 2   ┆ 4.0  ┆ true  ┆ Apple  │
│ 3   ┆ 10.0 ┆ false ┆ Apple  │
│ 4   ┆ 13.0 ┆ false ┆ Banana │
│ 5   ┆ 17.0 ┆ true  ┆ Banana │
└─────┴──────┴───────┴────────┘

To achieve this one currently would need to do the following.

df.sort('b', descending=True).groupby('d', maintain_order=True).head(1)
shape: (3, 4)
┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Banana ┆ 5   ┆ 17.0 ┆ true  │
│ Apple  ┆ 3   ┆ 10.0 ┆ false │
│ Orange ┆ 2   ┆ 0.5  ┆ true  │
└────────┴─────┴──────┴───────┘

The suggested feature would be:

df.groupby('d').top_k(k=1, by='b')

shape: (3, 4)
┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Banana ┆ 5   ┆ 17.0 ┆ true  │
│ Apple  ┆ 3   ┆ 10.0 ┆ false │
│ Orange ┆ 2   ┆ 0.5  ┆ true  │
└────────┴─────┴──────┴───────┘

tzeitim avatar Jul 24 '23 11:07 tzeitim

Just adding for reference:

We need to add support for top_k of struct dtypes.

https://stackoverflow.com/questions/76596952/

cmdlineluser avatar Jul 25 '23 10:07 cmdlineluser

This could replace the first and last methods in the GroupBy namespace to align with the DataFrame and Expr namespaces.

rjthoen avatar Mar 21 '24 21:03 rjthoen

We would appreciate a PR for this, but currently no core developer effort would go to this right now.

In principle almost every operation would enjoy being available on the GroupBy namespace, after all, any operation could be executed over groups. Any core developer effort in this area would go towards enabling this more general use in a generic grouping context, instead of adding more specific cases one-by-one.

orlp avatar Mar 22 '24 13:03 orlp

You can write

df.groupby('d').agg(pl.all().top_k(k=1, by='b'))

We favor expressions over explicit functions on the GroupBy state.

Sorry for the confusion, we also don't want to add this functionality as the desugared syntax gives all the power you need.

ritchie46 avatar Mar 28 '24 13:03 ritchie46

Are you sure? Expr.top_k doesn't have a by argument

MarcoGorelli avatar Mar 28 '24 13:03 MarcoGorelli

Ah, then the request should be adding a by argument to top_k.

ritchie46 avatar Mar 28 '24 13:03 ritchie46

Sure, thanks for reopening

One point to note that is that to do what @tzeitim wants, there should be a stable ordering guarantee (or at least option). E.g. if you start with

In [40]: df = pl.DataFrame({
    ...:     'name': ['a', 'a', 'b'],
    ...:     'score': [21, 21, 22],
    ...:     'attempt': [1, 2, 1],
    ...:     'day': [6, 7, 8],
    ...: })

In [41]: df
Out[41]:
shape: (3, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ ---  ┆ ---   ┆ ---     ┆ --- │
│ str  ┆ i64   ┆ i64     ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a    ┆ 21    ┆ 1       ┆ 6   │
│ a    ┆ 21    ┆ 2       ┆ 7   │
│ b    ┆ 22    ┆ 1       ┆ 8   │
└──────┴───────┴─────────┴─────┘

and you do df.group_by('a').agg(pl.all().top_k(by='score')), then you expect to get

shape: (2, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ ---  ┆ ---   ┆ ---     ┆ --- │
│ str  ┆ i64   ┆ i64     ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a    ┆ 21    ┆ 1       ┆ 6   │
│ b    ┆ 22    ┆ 1       ┆ 8   │
└──────┴───────┴─────────┴─────┘

and not

shape: (2, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ ---  ┆ ---   ┆ ---     ┆ --- │
│ str  ┆ i64   ┆ i64     ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a    ┆ 21    ┆ 2       ┆ 6   │
│ b    ┆ 22    ┆ 1       ┆ 8   │
└──────┴───────┴─────────┴─────┘

which I think would also be a valid output of Expr.top_k(1, by='score')

MarcoGorelli avatar Mar 28 '24 14:03 MarcoGorelli