polars
polars copied to clipboard
`fill_null` resulting data type is incorrect in some cases
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
lf = pl.LazyFrame({"a": [1, 2, None]}, schema={"a": pl.UInt32})
result = lf.select(pl.col("a").fill_null(999))
print(result.schema) # {'a': Int64}
print(result.collect().schema) # {'a': UInt32}
Log output
No response
Issue description
The result should be the supertype of the original column and the fill value. In this case, the fill value is of type Int32
, so the supertype is Int64
. The lazy schema is correct, but the result is not.
In this case, the fill value fits in the original data type (UInt32
), but we don't want the result to be dependent on the data.
Expected behavior
Result after collecting should be Int64
.
Installed versions
main
If you do fill_null(x)
with x=999.0
or x="999"
, evidently the dtype of the output depends on the value of x
.
So I don't understand what it means to say "we don't want the dtype of the result to depend on the value of x
".
If you go this way, I hope we can at least still specify fill_null(pl.lit(999, dtype=pl.UInt32))
to get the old behavior.
I wonder why 999
isn't automatically interpreted as a literal of the smallest integer type that holds 999.
If you do
fill_null(x)
withx=999.0
orx="999"
, evidently the dtype of the output depends on the value ofx
. So I don't understand what it means to say "we don't want the dtype of the result to depend on the value ofx
".
The input gets converted to an expression. Obviously lit("a")
leads to a different type than lit(10)
. The type of that expression determines the output type.
If you go this way, I hope we can at least still specify
fill_null(pl.lit(999, dtype=pl.UInt32))
to get the old behavior.
fill_null
takes expression input, so you can.
I wonder why
999
isn't automatically interpreted as a literal of the smallest integer type that holds 999.
When calling lit(int)
, it is interpreted as an Int32
, or if that doesn't fit, Int64
or UInt64
. We could do Int8
or Int16
but those are prone to overflow so we default to Int32
. In some cases, such as this specific example, smaller inferred types would be better. You can always control it yourself.
Rather than the super type, would it be better to maintain the column type, and raise an error if the literal can't be interpreted as that type? I think it's better to have the user explicitly cast rather than implicitly doing it for them during the fill_null
- and that they should be responsible for ensuring the value they are filling with fits in the dtype.
That sounds reasonable but is going cause unexpected behavior. It's very similar to https://github.com/pola-rs/polars/issues/8090.
import polars as pl
s = pl.Series([1, None, 2])
result = s.fill_null(1.5) # 1.5 is silently cast to an integer
print(result)
shape: (3,)
Series: '' [i64]
[
1
1 <- oops
2
]
It's better to upcast the original column to float.
Are you sure it's better to upcast? Why not raise if the value isn't compatible with the column's dtype?
Related, pandas is going in the opposite direction here (as in, moving away from upcasting on the user's behalf) https://pandas.pydata.org/pdeps/0006-ban-upcasting.html
As I have written in the other issue which was closed:
Not so long ago in the R language the tidyverse team had similar issues and they came up with the vctrs package to solve this and provide some solid theory to it as well. Important concepts are type and size stability and prototypes and sizes. Another interesting idea is the difference between casting and coercion. Introducing this solved a lot of issues and made all of the tidyverse (so in particular dplyr and tidyr) so much more predictable and also much easier to understand.
I think I'd expect
fill_null
to only accept
- a value of the same type
null
and to raise on anything else
Looking at this with the theory from the vctrs package in mind I would argue that this is not quite correct. Instead, it should try to find a common type and cast the values to this common type. So, one could still fill Int64 with Int32 and I think also Int64 with Float64:
df = pl.DataFrame({"x": [0, None]})
df.with_columns(pl.col("x").fill_null(2.5))
shape: (2, 1)
┌─────┐
│ x │
│ --- │
│ f64 │
╞═════╡
│ 0.0 │
│ 2.5 │
└─────┘
Originally posted by @mgirlich in https://github.com/pola-rs/polars/issues/13411#issuecomment-1900129031
And inserting a string into an integer series should error.
Upcasting is one of the worst parts of pandas, so it's surprising to see polars going in that direction!
Just add strict=True
to fill_null
, like with cast
:
pl.Series([1, None, 2]).fill_null(1.5) # error
pl.Series([1, None, 2]).fill_null(1.5, strict=False) # upcasts
Also, should the upcast be based just on the dtype, or also on the value? I would intuitively expect this to upcast to Int32
because -1
is representable as an Int32
. However, it actually upcasts to Int64
because that's the supertype of Int32
and UInt32
. This seems overly conservative.
>>> pl.Series([1, None], dtype=pl.UInt32).fill_null(pl.lit(-1, dtype=pl.Int32))
shape: (2,)
Series: '' [i64]
[
1
-1
]
Another benefit of basing the target dtype on the value you're filling nulls with, not just on the value's dtype, is that you can do:
pl.Series([1, None], dtype=pl.UInt32).fill_null(-1, strict=False)
and automatically recognize that the -1
is representable as an Int32
even though it's an Int64
, allowing you to cast to Int32
.
Are you sure it's better to upcast? Why not raise if the value isn't compatible with the column's dtype?
We should actually raise, but we currently do not have the infrastructure to do so yet. Which is why I initially leaned to upcasting to supertype.
Internally, we have cast/strict cast, but neither of these can be made to throw an error on conversion of float to integer. We actually need some type of coerce as @mgirlich mentioned. I'll make an issue for that so we can properly address this.
To be clear, the initial data type should be preserved after fill_null
.
I just got bitten by this and want to mention that in some cases using replace()
might be better to ensure that the datatype of the filled column doesn't change:
import polars as pl
df = pl.DataFrame(
{
"a": [1, 2, None],
"b": [4, None, 6],
}
)
df.with_columns(pl.col("b").fill_null("a"))
# shape: (3, 2)
# ┌──────┬─────┐
# │ a ┆ b │
# │ --- ┆ --- │
# │ i64 ┆ str │ # <-------------------- datatype changed from i64 to str
# ╞══════╪═════╡
# │ 1 ┆ 4 │
# │ 2 ┆ a │
# │ null ┆ 6 │
# └──────┴─────┘
df.with_columns(pl.col("b").replace(None, "a"))
# polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in
# column 'literal' for 1 out of 1 values: ["a"]