polars icon indicating copy to clipboard operation
polars copied to clipboard

Ability to set masked values within each group in groupby context

Open Teamon9161 opened this issue 2 years ago • 5 comments

Since rank does not handle null values, I want to write a rank function that can handle null values.

import numpy as np
import polars as pl

df = pl.DataFrame({
    'group': ['a'] * 3 + ['b'] * 3,
    'value': [2, 1, None, 4, 5, 6],
})
df
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ i64   │
╞═══════╪═══════╡
│ a     ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ 1     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 4     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 5     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 6     │
└───────┴───────┘

It works well if I didn't use groupby since I can use when-then-otherwise to set values.

def valid_rank(expr: pl.Expr, reverse=False):
    """handle null values when rank"""
    FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
    mask = expr.is_null()
    expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX) 
    return pl.when(~mask).then(expr.rank(reverse=reverse)).otherwise(None)

df.with_column(valid_rank(pl.col('value')))
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ str   ┆ f32   │
╞═══════╪═══════╡
│ a     ┆ 2.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ 1.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a     ┆ null  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 3.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 4.0   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b     ┆ 5.0   │
└───────┴───────┘

However, in groupby context, the predicate col("value").is_not_null() in when->then->otherwise is not an aggregation so I will get

ComputeError: the predicate 'not(col("value").is_null())' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would

Usually I have to make some calculations within each group after rank and I am worried about performance if I use partition_by to split the DataFrame. So I hope that Polars can have expressions like np.putmask or similar functions that can set values within each group.

def valid_rank(expr: pl.Expr, reverse=False):
    """handle null values when rank"""
    FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
    mask = expr.is_null()
    expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
    # return pl.putmask(expr.rank(reverse=reverse), mask, None)  # hope
    # return expr.rank(reverse=reverse).set(mask, None)  # hope

Teamon9161 avatar Aug 05 '22 16:08 Teamon9161

Can you make a stackoverlfow question? Then others can learn from the answer.

ritchie46 avatar Aug 05 '22 16:08 ritchie46

Sorry to bother you, I'm confused about two things:

(1) Ignoring the issue of null values: I also want to get the rank of each value within its group. The only way I can see to do this is by partitioning the dataframe by group and applying .rank() separately on each piece. Can you explain how this can be done with .groupby()? I thought groupby always aggregates each group to a single row.

(2) Since the rank of a null value is clearly supposed to be null, shouldn't it be regarded as a bug that rank fills nulls without raising a value error?

s-banach avatar Aug 20 '22 16:08 s-banach

(1) use over or (groupby + explode).

import numpy as np
import polars as pl

df = pl.DataFrame({
    'group': ['a'] * 3 + ['b'] * 3,
    'value': [2, 1, None, 4, 5, 6],
})
df.select(pl.col('value').rank().over('group'))

┌───────┬───────┬──────┐
│ group ┆ value ┆ rank │
│ ---   ┆ ---   ┆ ---  │
│ str   ┆ i64   ┆ f32  │
╞═══════╪═══════╪══════╡
│ a     ┆ 2     ┆ 3.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a     ┆ 1     ┆ 2.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ a     ┆ null  ┆ 1.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 4     ┆ 1.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 5     ┆ 2.0  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b     ┆ 6     ┆ 3.0  │
└───────┴───────┴──────┘

(2) It should, but if there is an expression that can set masked values within each group in groupby context, we can easily handle NaN values in other functions.

Teamon9161 avatar Aug 20 '22 16:08 Teamon9161

On Stack Overflow, I've prototyped an approach (with working code) that will hopefully strike an acceptable balance among:

  • ease-of-use (including convenience methods)
  • impact to code (very little)
  • required changes to the Polars API (none)
  • performance

It's a long post. I would have included it here, but the code requires no changes to the Polars API.

And thank you, @ritchie46 for tolerating all the issues I logged recently as I developed this. (#4492,#4444, #4360, #4433 ... to name a few)

cbilot avatar Aug 20 '22 21:08 cbilot