polars icon indicating copy to clipboard operation
polars copied to clipboard

No warning thrown when using pl.clip on int column

Open nicolas-mng opened this issue 1 year ago • 7 comments

Description

Hi, I have this situation occurring to me and I think that polars throwing an error would have helped me debug.

# Case 1 with an int64 polars DataFrame
# Trying to clip a column between min_clip and max_clip, both being float values
import polars as pl
import numpy as np

df = pl.DataFrame({"column": np.arange(100, dtype=np.int64)})
min_clip, max_clip = 0.2, 20
df = df.with_columns(
                [
                    pl.lit(min_clip).alias("min_clip"),  # just here to show the actual values and types
                    pl.lit(max_clip).alias("max_clip"),
                    (pl.col("column").clip(min_clip, max_clip)).alias("column_clipped"),
                ])

┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column      ┆ min_clip       ┆ max_clip      ┆ column_clipped│
│ ---       ┆ ---               ┆ ---               ┆ ---               │
│ i64       ┆ f64               ┆ i32               ┆ i64               │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0         ┆ 0.2               ┆ 20                ┆ 0 <- the min clipped value here is wrong, it should be 0.2
│ 1         ┆ 0.2               ┆ 20                ┆ 1                 │
│ 2         ┆ 0.2               ┆ 20                ┆ 2                 │
│ 3         ┆ 0.2               ┆ 20                ┆ 3                 │
│ …         ┆ …                 ┆ …                 ┆ …                 │
│ 96        ┆ 0.2               ┆ 20                ┆ 20                │
│ 97        ┆ 0.2               ┆ 20                ┆ 20                │
│ 98        ┆ 0.2               ┆ 20                ┆ 20                │
│ 99        ┆ 0.2               ┆ 20                ┆ 20                │
└───────────┴───────────────────┴───────────────────┴───────────────────┘

# Case 2 with a float32 polars DataFrame
df = pl.DataFrame({"column": np.arange(100, dtype=np.float32)})
┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column       ┆ min_clip      ┆ max_clip      ┆ column_clipped│
│ ---       ┆ ---               ┆ ---               ┆ ---               │
│ f32       ┆ f64               ┆ i32               ┆ f32               │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0.0       ┆ 0.2               ┆ 20                ┆ 0.2  <- it now works│
│ 1.0       ┆ 0.2               ┆ 20                ┆ 1.0               │
│ 2.0       ┆ 0.2               ┆ 20                ┆ 2.0               │
│ 3.0       ┆ 0.2               ┆ 20                ┆ 3.0               │
│ …         ┆ …                 ┆ …                 ┆ …                 │
│ 96.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 97.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 98.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
│ 99.0      ┆ 0.2               ┆ 20                ┆ 20.0              │
└───────────┴───────────────────┴───────────────────┴───────────────────┘

The error must be happening because of various casting of ints and floats and adding a .cast(pl.Float32) seems to fix the issue. But in my actual use case, this operation was hidden within a more convoluted DataFrame operation so it was harder to track the origin of my bug. Maybe having polars throw a warning here would help? Something about boundaries not respected because of type mismatch?

Thanks for the work on polars!

nicolas-mng avatar Sep 28 '23 13:09 nicolas-mng

Could you fix the test case? Currently doesn't run ;)

alexander-beedie avatar Sep 28 '23 14:09 alexander-beedie

Oops sorry, it now works :)

nicolas-mng avatar Sep 28 '23 14:09 nicolas-mng

Not sure what the right way to go is here. We could have clip always return a float, but that feels inefficient...

stinodego avatar Oct 02 '23 07:10 stinodego

I guess, either autocasting to the higher of the precisions or throwing a type mismatch error would be two legitimate options. I'm worried now that a simple warning would be too quiet and I cannot think of too many use-cases where having values wrongly clipped is ok?

nicolas-mng avatar Oct 04 '23 08:10 nicolas-mng

I would assume x.clip(0.2, 20) clips all integers below 0.2 and above 20. So I would assume 0 gets clipped up to 1 in an integer column. Just got caught by surprise with this one today.

s-banach avatar Jan 11 '24 19:01 s-banach

Similar issue, no warning with Series.scatter(), scattering float values into an int column.

s-banach avatar Jan 12 '24 18:01 s-banach

I guess we should strict cast the inputs to the type of the column, rather than regular cast. That would raise an error with bad inputs and require you to cast the column manually first.

EDIT: Actually, we should use supertypes here.

EDIT2: Actually, we should not upcast the original column...

stinodego avatar Jan 12 '24 20:01 stinodego