polars icon indicating copy to clipboard operation
polars copied to clipboard

`fill_null` resulting data type is incorrect in some cases

Open stinodego opened this issue 1 year ago • 10 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

lf = pl.LazyFrame({"a": [1, 2, None]}, schema={"a": pl.UInt32})

result = lf.select(pl.col("a").fill_null(999))

print(result.schema)  # {'a': Int64}
print(result.collect().schema)  # {'a': UInt32}

Log output

No response

Issue description

The result should be the supertype of the original column and the fill value. In this case, the fill value is of type Int32, so the supertype is Int64. The lazy schema is correct, but the result is not.

In this case, the fill value fits in the original data type (UInt32), but we don't want the result to be dependent on the data.

Expected behavior

Result after collecting should be Int64.

Installed versions

main

stinodego avatar Jan 17 '24 13:01 stinodego

If you do fill_null(x) with x=999.0 or x="999", evidently the dtype of the output depends on the value of x. So I don't understand what it means to say "we don't want the dtype of the result to depend on the value of x".

If you go this way, I hope we can at least still specify fill_null(pl.lit(999, dtype=pl.UInt32)) to get the old behavior. I wonder why 999 isn't automatically interpreted as a literal of the smallest integer type that holds 999.

s-banach avatar Jan 17 '24 14:01 s-banach

If you do fill_null(x) with x=999.0 or x="999", evidently the dtype of the output depends on the value of x. So I don't understand what it means to say "we don't want the dtype of the result to depend on the value of x".

The input gets converted to an expression. Obviously lit("a") leads to a different type than lit(10). The type of that expression determines the output type.

If you go this way, I hope we can at least still specify fill_null(pl.lit(999, dtype=pl.UInt32)) to get the old behavior.

fill_null takes expression input, so you can.

I wonder why 999 isn't automatically interpreted as a literal of the smallest integer type that holds 999.

When calling lit(int), it is interpreted as an Int32, or if that doesn't fit, Int64 or UInt64. We could do Int8 or Int16 but those are prone to overflow so we default to Int32. In some cases, such as this specific example, smaller inferred types would be better. You can always control it yourself.

stinodego avatar Jan 17 '24 14:01 stinodego

Rather than the super type, would it be better to maintain the column type, and raise an error if the literal can't be interpreted as that type? I think it's better to have the user explicitly cast rather than implicitly doing it for them during the fill_null - and that they should be responsible for ensuring the value they are filling with fits in the dtype.

nameexhaustion avatar Jan 17 '24 14:01 nameexhaustion

That sounds reasonable but is going cause unexpected behavior. It's very similar to https://github.com/pola-rs/polars/issues/8090.

import polars as pl

s = pl.Series([1, None, 2])
result = s.fill_null(1.5)  # 1.5 is silently cast to an integer
print(result)
shape: (3,)
Series: '' [i64]
[
        1
        1  <- oops
        2
]

It's better to upcast the original column to float.

stinodego avatar Jan 18 '24 10:01 stinodego

Are you sure it's better to upcast? Why not raise if the value isn't compatible with the column's dtype?

Related, pandas is going in the opposite direction here (as in, moving away from upcasting on the user's behalf) https://pandas.pydata.org/pdeps/0006-ban-upcasting.html

MarcoGorelli avatar Jan 18 '24 23:01 MarcoGorelli

As I have written in the other issue which was closed:

Not so long ago in the R language the tidyverse team had similar issues and they came up with the vctrs package to solve this and provide some solid theory to it as well. Important concepts are type and size stability and prototypes and sizes. Another interesting idea is the difference between casting and coercion. Introducing this solved a lot of issues and made all of the tidyverse (so in particular dplyr and tidyr) so much more predictable and also much easier to understand.

I think I'd expect fill_null to only accept

  • a value of the same type
  • null

and to raise on anything else

Looking at this with the theory from the vctrs package in mind I would argue that this is not quite correct. Instead, it should try to find a common type and cast the values to this common type. So, one could still fill Int64 with Int32 and I think also Int64 with Float64:

df = pl.DataFrame({"x": [0, None]})
df.with_columns(pl.col("x").fill_null(2.5))
shape: (2, 1)
┌─────┐
│ x   │
│ --- │
│ f64 │
╞═════╡
│ 0.0 │
│ 2.5 │
└─────┘

Originally posted by @mgirlich in https://github.com/pola-rs/polars/issues/13411#issuecomment-1900129031

And inserting a string into an integer series should error.

mgirlich avatar Jan 22 '24 05:01 mgirlich

Upcasting is one of the worst parts of pandas, so it's surprising to see polars going in that direction!

Just add strict=True to fill_null, like with cast:

pl.Series([1, None, 2]).fill_null(1.5)  # error
pl.Series([1, None, 2]).fill_null(1.5, strict=False)  # upcasts

Wainberg avatar Feb 01 '24 23:02 Wainberg

Also, should the upcast be based just on the dtype, or also on the value? I would intuitively expect this to upcast to Int32 because -1 is representable as an Int32. However, it actually upcasts to Int64 because that's the supertype of Int32 and UInt32. This seems overly conservative.

>>> pl.Series([1, None], dtype=pl.UInt32).fill_null(pl.lit(-1, dtype=pl.Int32))
shape: (2,)
Series: '' [i64]
[
        1
        -1
]

Wainberg avatar Feb 01 '24 23:02 Wainberg

Another benefit of basing the target dtype on the value you're filling nulls with, not just on the value's dtype, is that you can do:

pl.Series([1, None], dtype=pl.UInt32).fill_null(-1, strict=False)

and automatically recognize that the -1 is representable as an Int32 even though it's an Int64, allowing you to cast to Int32.

Wainberg avatar Feb 02 '24 00:02 Wainberg

Are you sure it's better to upcast? Why not raise if the value isn't compatible with the column's dtype?

We should actually raise, but we currently do not have the infrastructure to do so yet. Which is why I initially leaned to upcasting to supertype.

Internally, we have cast/strict cast, but neither of these can be made to throw an error on conversion of float to integer. We actually need some type of coerce as @mgirlich mentioned. I'll make an issue for that so we can properly address this.

To be clear, the initial data type should be preserved after fill_null.

stinodego avatar Feb 11 '24 10:02 stinodego

I just got bitten by this and want to mention that in some cases using replace() might be better to ensure that the datatype of the filled column doesn't change:

import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, None],
        "b": [4, None, 6],
    }
)

df.with_columns(pl.col("b").fill_null("a"))

# shape: (3, 2)
# ┌──────┬─────┐
# │ a    ┆ b   │
# │ ---  ┆ --- │
# │ i64  ┆ str │ # <-------------------- datatype changed from i64 to str
# ╞══════╪═════╡
# │ 1    ┆ 4   │
# │ 2    ┆ a   │
# │ null ┆ 6   │
# └──────┴─────┘

df.with_columns(pl.col("b").replace(None, "a"))
# polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in 
# column 'literal' for 1 out of 1 values: ["a"]

etiennebacher avatar Aug 16 '24 14:08 etiennebacher