polars
polars copied to clipboard
Add `top_k` in the GroupBy namespace
Problem description
It would be useful to include a top_k
function to the GroupBy namespace.
Toy data frame
df = pl.DataFrame(
{
"a": [1, 2, 2, 3, 4, 5],
"b": [5.5, 0.5, 4, 10, 13, 17],
"c": [True, True, True, False, False, True],
"d": ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"],
}
)
shape: (6, 4)
┌─────┬──────┬───────┬────────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str │
╞═════╪══════╪═══════╪════════╡
│ 1 ┆ 5.5 ┆ true ┆ Apple │
│ 2 ┆ 0.5 ┆ true ┆ Orange │
│ 2 ┆ 4.0 ┆ true ┆ Apple │
│ 3 ┆ 10.0 ┆ false ┆ Apple │
│ 4 ┆ 13.0 ┆ false ┆ Banana │
│ 5 ┆ 17.0 ┆ true ┆ Banana │
└─────┴──────┴───────┴────────┘
To achieve this one currently would need to do the following.
df.sort('b', descending=True).groupby('d', maintain_order=True).head(1)
shape: (3, 4)
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Banana ┆ 5 ┆ 17.0 ┆ true │
│ Apple ┆ 3 ┆ 10.0 ┆ false │
│ Orange ┆ 2 ┆ 0.5 ┆ true │
└────────┴─────┴──────┴───────┘
The suggested feature would be:
df.groupby('d').top_k(k=1, by='b')
shape: (3, 4)
┌────────┬─────┬──────┬───────┐
│ d ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ bool │
╞════════╪═════╪══════╪═══════╡
│ Banana ┆ 5 ┆ 17.0 ┆ true │
│ Apple ┆ 3 ┆ 10.0 ┆ false │
│ Orange ┆ 2 ┆ 0.5 ┆ true │
└────────┴─────┴──────┴───────┘
Just adding for reference:
We need to add support for top_k of struct dtypes.
https://stackoverflow.com/questions/76596952/
This could replace the first and last methods in the GroupBy namespace to align with the DataFrame and Expr namespaces.
We would appreciate a PR for this, but currently no core developer effort would go to this right now.
In principle almost every operation would enjoy being available on the GroupBy
namespace, after all, any operation could be executed over groups. Any core developer effort in this area would go towards enabling this more general use in a generic grouping context, instead of adding more specific cases one-by-one.
You can write
df.groupby('d').agg(pl.all().top_k(k=1, by='b'))
We favor expressions over explicit functions on the GroupBy
state.
Sorry for the confusion, we also don't want to add this functionality as the desugared syntax gives all the power you need.
Are you sure? Expr.top_k
doesn't have a by
argument
Ah, then the request should be adding a by
argument to top_k
.
Sure, thanks for reopening
One point to note that is that to do what @tzeitim wants, there should be a stable ordering guarantee (or at least option). E.g. if you start with
In [40]: df = pl.DataFrame({
...: 'name': ['a', 'a', 'b'],
...: 'score': [21, 21, 22],
...: 'attempt': [1, 2, 1],
...: 'day': [6, 7, 8],
...: })
In [41]: df
Out[41]:
shape: (3, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a ┆ 21 ┆ 1 ┆ 6 │
│ a ┆ 21 ┆ 2 ┆ 7 │
│ b ┆ 22 ┆ 1 ┆ 8 │
└──────┴───────┴─────────┴─────┘
and you do df.group_by('a').agg(pl.all().top_k(by='score'))
, then you expect to get
shape: (2, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a ┆ 21 ┆ 1 ┆ 6 │
│ b ┆ 22 ┆ 1 ┆ 8 │
└──────┴───────┴─────────┴─────┘
and not
shape: (2, 4)
┌──────┬───────┬─────────┬─────┐
│ name ┆ score ┆ attempt ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪═════════╪═════╡
│ a ┆ 21 ┆ 2 ┆ 6 │
│ b ┆ 22 ┆ 1 ┆ 8 │
└──────┴───────┴─────────┴─────┘
which I think would also be a valid output of Expr.top_k(1, by='score')