polars icon indicating copy to clipboard operation
polars copied to clipboard

Should `clip` bounds propagate nulls?

Open stinodego opened this issue 1 year ago • 4 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
result = df.select(pl.col("a").clip(pl.col("min")))
print(result)

Log output

shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ null │
│ 2    │
└──────┘

Issue description

If the bound contains a null value, it is currently treated as "bound is unknown, so the result of the operation is unknown".

You could also argue that a null value means "no bound is specified, so the original value must be retained".

I'm not sure about this one - should this be changed?

Expected behavior

Preserve original value?

shape: (3, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 1    │
│ 2    │
└──────┘

Installed versions

main

stinodego avatar Jan 14 '24 07:01 stinodego

Strong opinion that None/null bounds should NOT propagate!

Given [1, 2, 3, 4]

  • clip(lower_bound=2, upper_bound=None) -> [2, 2, 3, 4] fixed lower & don't care about upper bound
  • clip(lower_bound=None, upper_bound=3) -> [1, 2, 3, 3] fixed upper & don't care about lower bound
  • clip(lower_bound=None, upper_bound=None) -> [null, null, null, null] I expect don't care about bounds but "drops" all values?

Feels suuuuper strange and unexpected.

Also just tried some example and there seems to be quite some inconsistencies between clip(min, max) and clip(pl.col('min'), pl.col('max')). They should probably lead to the same results?:

pl.DataFrame(
    {
        "a": [1, 2, 3, 4],
        "min": None,
        "max": None,
    }
).with_columns(
    clip_none_none=pl.col("a").clip(None, None),
    clip_col_none_none=pl.col("a").clip(
        pl.col("min"),
        pl.col("max"),
    ),

    clip_min_none=pl.col("a").clip(2, None),
    clip_col_min_none=pl.col("a").clip(
        pl.col("min").fill_null(2),
        pl.col("max"),
    ),

    clip_none_max=pl.col("a").clip(None, 3),
    clip_col_none_max=pl.col("a").clip(
        pl.col("min"),
        pl.col("max").fill_null(3),
    ),

    clip_min_max=pl.col("a").clip(2, 3),
    clip_col_min_max=pl.col("a").clip(
        pl.col("min").fill_null(2),
        pl.col("max").fill_null(3),
    ),
)

shape: (4, 11)
┌─────┬──────┬──────┬───────────┬───────────────┬──────────┬──────────────┬──────────┬──────────────┬─────────┬─────────────┐
│ a   ┆ min  ┆ max  ┆ none_none ┆ col_none_none ┆ min_none ┆ col_min_none ┆ none_max ┆ col_none_max ┆ min_max ┆ col_min_max │
│ --- ┆ ---  ┆ ---  ┆ ---       ┆ ---           ┆ ---      ┆ ---          ┆ ---      ┆ ---          ┆ ---     ┆ ---         │
│ i64 ┆ null ┆ null ┆ i64       ┆ i64           ┆ i64      ┆ i64          ┆ i64      ┆ i64          ┆ i64     ┆ i64         │
╞═════╪══════╪══════╪═══════════╪═══════════════╪══════════╪══════════════╪══════════╪══════════════╪═════════╪═════════════╡
│ 1   ┆ null ┆ null ┆ 1         ┆ null          ┆ 2        ┆ null         ┆ 1        ┆ null         ┆ 2       ┆ 2           │
│ 2   ┆ null ┆ null ┆ 2         ┆ null          ┆ 2        ┆ null         ┆ 2        ┆ null         ┆ 2       ┆ 2           │
│ 3   ┆ null ┆ null ┆ 3         ┆ null          ┆ 3        ┆ null         ┆ 3        ┆ null         ┆ 3       ┆ 3           │
│ 4   ┆ null ┆ null ┆ 4         ┆ null          ┆ 4        ┆ null         ┆ 3        ┆ null         ┆ 3       ┆ 3           │
└─────┴──────┴──────┴───────────┴───────────────┴──────────┴──────────────┴──────────┴──────────────┴─────────┴─────────────┘

Julian-J-S avatar Jan 14 '24 12:01 Julian-J-S

Also just tried some example and there seems to be quite some inconsistencies between clip(min, max) and clip(pl.col('min'), pl.col('max')). They should probably lead to the same results?:

Not setting a bound is (currently) different from encountering null in a bound column. clip(None, None) is a no-op and will give you back the original column.

stinodego avatar Jan 14 '24 12:01 stinodego

My point is that there are inconsistencies when using a literal vs polars literal vs a column in the bounds

pl.DataFrame({"x": [0], "min": None}).with_columns(
    clip_col=pl.col("x").clip(lower_bound=pl.col("min")),
    clip_lit=pl.col("x").clip(lower_bound=None),
    clip_li2=pl.col("x").clip(lower_bound=pl.lit(None)),
)

shape: (1, 5)
┌─────┬──────┬──────────┬──────────┬───────────┐
│ x   ┆ min  ┆ clip_col ┆ clip_lit ┆ clip_lit2 │
│ --- ┆ ---  ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ null ┆ i64      ┆ i64      ┆ i64       │
╞═════╪══════╪══════════╪══════════╪═══════════╡
│ 0   ┆ null ┆ null     ┆ 0        ┆ null      │
└─────┴──────┴──────────┴──────────┴───────────┘

The three columns should imo definitely return the same value! None, pl.lit(None) and pl.col(x) which evaluates to None should lead to the same output imo. All are using an "empty" lower bound

Julian-J-S avatar Jan 14 '24 20:01 Julian-J-S

I would expect lower_bound = None means the same thing as lower_bound = -Inf and upper_bound = None means upper_bound = Inf, just from the math point of view, the bound doesn't present means the variable is unbounded.

Sage0614 avatar Jan 15 '24 22:01 Sage0614

Can we please continue this discussion or get more people involved if necessary? I see 2 problems here:

  1. Inconsistent
  2. Logical

1) Inconsistent

pl.DataFrame({"x": [1, 2, 3]}).with_columns(
    plexpr_none=pl.col("x").clip(lower_bound=pl.lit(None)),
    python_none=pl.col("x").clip(lower_bound=None),
)
shape: (3, 3)
┌─────┬─────────────┬─────────────┐
│ x   ┆ plexpr_none ┆ python_none │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ i64         ┆ i64         │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ null        ┆ 1           │
│ 2   ┆ null        ┆ 2           │
│ 3   ┆ null        ┆ 3           │
└─────┴─────────────┴─────────────┘

These two variants should imo clearly lead to the same result, but they treat the input completely differently.

  • expressions: None lower/upper bound -> result = null
  • python vars: None lower/upper bound -> result = no bound

This is a big problem because it is inconsistent and super unintuitive from a user perspective, and is different from any other polars function I can think of!

2) Logical

  • non-existent (null/none) bound should mean "open" / infinite. Why should a null bound "delete" the values?
  • there is no good solution if nulls propagate but you do not want that .clip(lower_bound=pl.col('min').fill_null(-999999)) which number to choose? python numbers have no bounds.

Julian-J-S avatar Jan 18 '24 08:01 Julian-J-S

min(null,1) should not be 1.

clip bounds should propagate nulls, otherwise it is performing actions which other users may not have asked for. clip should only deal with integers and floats. You can use other functions to handle nulls before or after clipping.

However, I do understand the sentiment. clip can have a strict flag which throws an error if there are nulls or nans in the column prompting the user to handle the situation.

arnabanimesh avatar Jan 18 '24 09:01 arnabanimesh

min(null,1) should not be 1.

@arnabanimesh In SQL you have LEAST and GREATEST functions, and they do in fact behave as LEAST(null, 1) == 1.


Our consistent principle is that null indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument to clip indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.

orlp avatar Jan 18 '24 11:01 orlp

@arnabanimesh In SQL you have LEAST and GREATEST functions, and they do in fact behave as LEAST(null, 1) == 1.

It seems it is not the case in mysql atleast: Link

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

arnabanimesh avatar Jan 18 '24 12:01 arnabanimesh

Our consistent principle is that null indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument to clip indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.

yes! 😃 Pandas also does it like this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.clip.html image

df = pd.DataFrame({'x': [5, 5, 5, 5]})
s = pd.Series([2, None, 10, None])
df.assign(clipped=df.clip(lower=s, axis=0))
   x  clipped
0  5        5
1  5        5
2  5       10
3  5        5

Julian-J-S avatar Jan 18 '24 12:01 Julian-J-S

It seems it is not the case in mysql atleast: Link

It is in Postgres and DuckDB. It does appear to have been standardized to propagate nulls in SQL 2023, but I doubt the wisdom of that.

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

I would not be opposed to this, although with a better name.

orlp avatar Jan 18 '24 12:01 orlp

If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.

I would not be opposed to this, although with a better name

I honestly don't understand how this is consistent and how to explain this to users.

If we have .clip(lower_bound=...)

  • None -> ignore the bound
  • pl.lit(None) or pl.col("min") -> propagate null

I would argue the user expects None & pl.lit(None) & columns with null to behave the same like in (all?!) other polars functions. If I pass in the value 4 somewhere I expect the same if I enter pl.lit(4) or pl.col("filled_with_4")

Not sure if I am weird here but I seriously don't understand 😨 😆

Julian-J-S avatar Jan 18 '24 12:01 Julian-J-S

Not sure if I am weird here but I seriously don't understand

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

We do have a specialized no_default input type which is used when None is a valid input value. Maybe it should be used here.

stinodego avatar Jan 18 '24 12:01 stinodego

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

If you were to do clip(None, upper, propagate_nulls=True), then the output would be full-null. I do think indeed as @stinodego mentions we need the no_default argument here if we were to add propagate_nulls.

Overall I'm a bit more tempted to not add propagate_nulls than to add the no_default interface. Users can always use pl.when if necessary.

orlp avatar Jan 18 '24 12:01 orlp

In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.

We do have a specialized no_default input type which is used when None is a valid input value. Maybe it should be used here.

thanks for the explanation, good to know 😄 👍

Julian-J-S avatar Jan 18 '24 13:01 Julian-J-S

I also don't think we should propagate nulls. And I also don't think it's worth a parameter. If you want to propagate null bounds you could use when like:

pl.when(pl.col("min").is_not_null() | pl.col("max").is_not_null()).then(pl.col("value").clip("min", "max"))

So we should change the default behaviour as a breaking change. This will also make None input behave the same as pl.lit(None) so we don't have to change the input parsing.

stinodego avatar Jan 18 '24 17:01 stinodego