polars
polars copied to clipboard
Ability to set masked values within each group in groupby context
Since rank
does not handle null values, I want to write a rank
function that can handle null values.
import numpy as np
import polars as pl
df = pl.DataFrame({
'group': ['a'] * 3 + ['b'] * 3,
'value': [2, 1, None, 4, 5, 6],
})
df
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ a ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 6 │
└───────┴───────┘
It works well if I didn't use groupby
since I can use when-then-otherwise
to set values.
def valid_rank(expr: pl.Expr, reverse=False):
"""handle null values when rank"""
FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
mask = expr.is_null()
expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
return pl.when(~mask).then(expr.rank(reverse=reverse)).otherwise(None)
df.with_column(valid_rank(pl.col('value')))
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ f32 │
╞═══════╪═══════╡
│ a ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 3.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 4.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 5.0 │
└───────┴───────┘
However, in groupby context, the predicate col("value").is_not_null()
in when->then->otherwise
is not an aggregation so I will get
ComputeError: the predicate 'not(col("value").is_null())' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would
Usually I have to make some calculations within each group after rank
and I am worried about performance if I use partition_by
to split the DataFrame. So I hope that Polars can have expressions like np.putmask
or similar functions that can set values within each group.
def valid_rank(expr: pl.Expr, reverse=False):
"""handle null values when rank"""
FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
mask = expr.is_null()
expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
# return pl.putmask(expr.rank(reverse=reverse), mask, None) # hope
# return expr.rank(reverse=reverse).set(mask, None) # hope
Can you make a stackoverlfow question? Then others can learn from the answer.
Sorry to bother you, I'm confused about two things:
(1) Ignoring the issue of null values: I also want to get the rank of each value
within its group
. The only way I can see to do this is by partitioning the dataframe by group
and applying .rank()
separately on each piece. Can you explain how this can be done with .groupby()
? I thought groupby always aggregates each group to a single row.
(2) Since the rank of a null value is clearly supposed to be null, shouldn't it be regarded as a bug that rank
fills nulls without raising a value error?
(1) use over
or (groupby
+ explode
).
import numpy as np
import polars as pl
df = pl.DataFrame({
'group': ['a'] * 3 + ['b'] * 3,
'value': [2, 1, None, 4, 5, 6],
})
df.select(pl.col('value').rank().over('group'))
┌───────┬───────┬──────┐
│ group ┆ value ┆ rank │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f32 │
╞═══════╪═══════╪══════╡
│ a ┆ 2 ┆ 3.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a ┆ null ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 4 ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 6 ┆ 3.0 │
└───────┴───────┴──────┘
(2) It should, but if there is an expression that can set masked values within each group in groupby context, we can easily handle NaN values in other functions.
On Stack Overflow, I've prototyped an approach (with working code) that will hopefully strike an acceptable balance among:
- ease-of-use (including convenience methods)
- impact to code (very little)
- required changes to the Polars API (none)
- performance
It's a long post. I would have included it here, but the code requires no changes to the Polars API.
And thank you, @ritchie46 for tolerating all the issues I logged recently as I developed this. (#4492,#4444, #4360, #4433 ... to name a few)