polars
polars copied to clipboard
Should `clip` bounds propagate nulls?
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
df = pl.DataFrame({"a": [0, 1, 2], "min": [1, None, 1]})
result = df.select(pl.col("a").clip(pl.col("min")))
print(result)
Log output
shape: (3, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ null │
│ 2 │
└──────┘
Issue description
If the bound contains a null value, it is currently treated as "bound is unknown, so the result of the operation is unknown".
You could also argue that a null value means "no bound is specified, so the original value must be retained".
I'm not sure about this one - should this be changed?
Expected behavior
Preserve original value?
shape: (3, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 1 │
│ 1 │
│ 2 │
└──────┘
Installed versions
main
Strong opinion that None/null bounds should NOT propagate!
Given [1, 2, 3, 4]
-
clip(lower_bound=2, upper_bound=None)
->[2, 2, 3, 4]
fixed lower & don't care about upper bound -
clip(lower_bound=None, upper_bound=3)
->[1, 2, 3, 3]
fixed upper & don't care about lower bound -
clip(lower_bound=None, upper_bound=None)
->[null, null, null, null]
I expect don't care about bounds but "drops" all values?
Feels suuuuper strange and unexpected.
Also just tried some example and there seems to be quite some inconsistencies between clip(min, max)
and clip(pl.col('min'), pl.col('max'))
. They should probably lead to the same results?:
pl.DataFrame(
{
"a": [1, 2, 3, 4],
"min": None,
"max": None,
}
).with_columns(
clip_none_none=pl.col("a").clip(None, None),
clip_col_none_none=pl.col("a").clip(
pl.col("min"),
pl.col("max"),
),
clip_min_none=pl.col("a").clip(2, None),
clip_col_min_none=pl.col("a").clip(
pl.col("min").fill_null(2),
pl.col("max"),
),
clip_none_max=pl.col("a").clip(None, 3),
clip_col_none_max=pl.col("a").clip(
pl.col("min"),
pl.col("max").fill_null(3),
),
clip_min_max=pl.col("a").clip(2, 3),
clip_col_min_max=pl.col("a").clip(
pl.col("min").fill_null(2),
pl.col("max").fill_null(3),
),
)
shape: (4, 11)
┌─────┬──────┬──────┬───────────┬───────────────┬──────────┬──────────────┬──────────┬──────────────┬─────────┬─────────────┐
│ a ┆ min ┆ max ┆ none_none ┆ col_none_none ┆ min_none ┆ col_min_none ┆ none_max ┆ col_none_max ┆ min_max ┆ col_min_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ null ┆ null ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪═══════════╪═══════════════╪══════════╪══════════════╪══════════╪══════════════╪═════════╪═════════════╡
│ 1 ┆ null ┆ null ┆ 1 ┆ null ┆ 2 ┆ null ┆ 1 ┆ null ┆ 2 ┆ 2 │
│ 2 ┆ null ┆ null ┆ 2 ┆ null ┆ 2 ┆ null ┆ 2 ┆ null ┆ 2 ┆ 2 │
│ 3 ┆ null ┆ null ┆ 3 ┆ null ┆ 3 ┆ null ┆ 3 ┆ null ┆ 3 ┆ 3 │
│ 4 ┆ null ┆ null ┆ 4 ┆ null ┆ 4 ┆ null ┆ 3 ┆ null ┆ 3 ┆ 3 │
└─────┴──────┴──────┴───────────┴───────────────┴──────────┴──────────────┴──────────┴──────────────┴─────────┴─────────────┘
Also just tried some example and there seems to be quite some inconsistencies between
clip(min, max)
andclip(pl.col('min'), pl.col('max'))
. They should probably lead to the same results?:
Not setting a bound is (currently) different from encountering null
in a bound column. clip(None, None)
is a no-op and will give you back the original column.
My point is that there are inconsistencies when using a literal vs polars literal vs a column in the bounds
pl.DataFrame({"x": [0], "min": None}).with_columns(
clip_col=pl.col("x").clip(lower_bound=pl.col("min")),
clip_lit=pl.col("x").clip(lower_bound=None),
clip_li2=pl.col("x").clip(lower_bound=pl.lit(None)),
)
shape: (1, 5)
┌─────┬──────┬──────────┬──────────┬───────────┐
│ x ┆ min ┆ clip_col ┆ clip_lit ┆ clip_lit2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ null ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════════╪══════════╪═══════════╡
│ 0 ┆ null ┆ null ┆ 0 ┆ null │
└─────┴──────┴──────────┴──────────┴───────────┘
The three columns should imo definitely return the same value!
None
, pl.lit(None)
and pl.col(x)
which evaluates to None should lead to the same output imo.
All are using an "empty" lower bound
I would expect lower_bound = None means the same thing as lower_bound = -Inf and upper_bound = None means upper_bound = Inf, just from the math point of view, the bound doesn't present means the variable is unbounded.
Can we please continue this discussion or get more people involved if necessary? I see 2 problems here:
- Inconsistent
- Logical
1) Inconsistent
pl.DataFrame({"x": [1, 2, 3]}).with_columns(
plexpr_none=pl.col("x").clip(lower_bound=pl.lit(None)),
python_none=pl.col("x").clip(lower_bound=None),
)
shape: (3, 3)
┌─────┬─────────────┬─────────────┐
│ x ┆ plexpr_none ┆ python_none │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════════════╪═════════════╡
│ 1 ┆ null ┆ 1 │
│ 2 ┆ null ┆ 2 │
│ 3 ┆ null ┆ 3 │
└─────┴─────────────┴─────────────┘
These two variants should imo clearly lead to the same result, but they treat the input completely differently.
- expressions:
None
lower/upper bound -> result =null
- python vars:
None
lower/upper bound -> result = no bound
This is a big problem because it is inconsistent and super unintuitive from a user perspective, and is different from any other polars function I can think of!
2) Logical
- non-existent (null/none) bound should mean "open" / infinite. Why should a null bound "delete" the values?
- there is no good solution if nulls propagate but you do not want that
.clip(lower_bound=pl.col('min').fill_null(-999999))
which number to choose? python numbers have no bounds.
min(null,1)
should not be 1
.
clip
bounds should propagate null
s, otherwise it is performing actions which other users may not have asked for. clip
should only deal with integers and floats. You can use other functions to handle nulls before or after clipping.
However, I do understand the sentiment. clip
can have a strict
flag which throws an error if there are null
s or nan
s in the column prompting the user to handle the situation.
min(null,1) should not be 1.
@arnabanimesh In SQL you have LEAST
and GREATEST
functions, and they do in fact behave as LEAST(null, 1) == 1
.
Our consistent principle is that null
indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument to clip
indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.
@arnabanimesh In SQL you have
LEAST
andGREATEST
functions, and they do in fact behave asLEAST(null, 1) == 1
.
It seems it is not the case in mysql atleast: Link
If we are doing this, we can have a clip_nulls
flag in the clip
function and have it default to False
so that it is not a breaking change.
Our consistent principle is that
null
indicates that the data is simply missing. It should be treated as if it doesn't exist. I would thus argue that a null argument toclip
indicates no bound. Thus I would say, clip bounds should not propagate nulls, only the main argument should.
yes! 😃
Pandas also does it like this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.clip.html
df = pd.DataFrame({'x': [5, 5, 5, 5]})
s = pd.Series([2, None, 10, None])
df.assign(clipped=df.clip(lower=s, axis=0))
x clipped
0 5 5
1 5 5
2 5 10
3 5 5
It seems it is not the case in mysql atleast: Link
It is in Postgres and DuckDB. It does appear to have been standardized to propagate nulls in SQL 2023, but I doubt the wisdom of that.
If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.
I would not be opposed to this, although with a better name.
If we are doing this, we can have a clip_nulls flag in the clip function and have it default to False so that it is not a breaking change.
I would not be opposed to this, although with a better name
I honestly don't understand how this is consistent and how to explain this to users.
If we have .clip(lower_bound=...)
-
None
-> ignore the bound -
pl.lit(None)
orpl.col("min")
-> propagatenull
I would argue the user expects None
& pl.lit(None)
& columns with null
to behave the same like in (all?!) other polars functions. If I pass in the value 4
somewhere I expect the same if I enter pl.lit(4)
or pl.col("filled_with_4")
Not sure if I am weird here but I seriously don't understand 😨 😆
Not sure if I am weird here but I seriously don't understand
In Python, we often parse None
input as 'input was not specified'. It is definitely not always the same as lit(None)
which is an expression of data type Null
.
We do have a specialized no_default
input type which is used when None
is a valid input value. Maybe it should be used here.
In Python, we often parse None input as 'input was not specified'. It is definitely not always the same as lit(None) which is an expression of data type Null.
If you were to do clip(None, upper, propagate_nulls=True)
, then the output would be full-null. I do think indeed as @stinodego mentions we need the no_default
argument here if we were to add propagate_nulls
.
Overall I'm a bit more tempted to not add propagate_nulls
than to add the no_default
interface. Users can always use pl.when
if necessary.
In Python, we often parse
None
input as 'input was not specified'. It is definitely not always the same aslit(None)
which is an expression of data typeNull
.We do have a specialized
no_default
input type which is used whenNone
is a valid input value. Maybe it should be used here.
thanks for the explanation, good to know 😄 👍
I also don't think we should propagate nulls. And I also don't think it's worth a parameter. If you want to propagate null bounds you could use when
like:
pl.when(pl.col("min").is_not_null() | pl.col("max").is_not_null()).then(pl.col("value").clip("min", "max"))
So we should change the default behaviour as a breaking change. This will also make None
input behave the same as pl.lit(None)
so we don't have to change the input parsing.