polars
polars copied to clipboard
No warning thrown when using pl.clip on int column
Description
Hi, I have this situation occurring to me and I think that polars throwing an error would have helped me debug.
# Case 1 with an int64 polars DataFrame
# Trying to clip a column between min_clip and max_clip, both being float values
import polars as pl
import numpy as np
df = pl.DataFrame({"column": np.arange(100, dtype=np.int64)})
min_clip, max_clip = 0.2, 20
df = df.with_columns(
[
pl.lit(min_clip).alias("min_clip"), # just here to show the actual values and types
pl.lit(max_clip).alias("max_clip"),
(pl.col("column").clip(min_clip, max_clip)).alias("column_clipped"),
])
┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column ┆ min_clip ┆ max_clip ┆ column_clipped│
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ i32 ┆ i64 │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0 ┆ 0.2 ┆ 20 ┆ 0 <- the min clipped value here is wrong, it should be 0.2
│ 1 ┆ 0.2 ┆ 20 ┆ 1 │
│ 2 ┆ 0.2 ┆ 20 ┆ 2 │
│ 3 ┆ 0.2 ┆ 20 ┆ 3 │
│ … ┆ … ┆ … ┆ … │
│ 96 ┆ 0.2 ┆ 20 ┆ 20 │
│ 97 ┆ 0.2 ┆ 20 ┆ 20 │
│ 98 ┆ 0.2 ┆ 20 ┆ 20 │
│ 99 ┆ 0.2 ┆ 20 ┆ 20 │
└───────────┴───────────────────┴───────────────────┴───────────────────┘
# Case 2 with a float32 polars DataFrame
df = pl.DataFrame({"column": np.arange(100, dtype=np.float32)})
┌───────────┬───────────────────┬───────────────────┬───────────────────┐
│ column ┆ min_clip ┆ max_clip ┆ column_clipped│
│ --- ┆ --- ┆ --- ┆ --- │
│ f32 ┆ f64 ┆ i32 ┆ f32 │
╞═══════════╪═══════════════════╪═══════════════════╪═══════════════════╡
│ 0.0 ┆ 0.2 ┆ 20 ┆ 0.2 <- it now works│
│ 1.0 ┆ 0.2 ┆ 20 ┆ 1.0 │
│ 2.0 ┆ 0.2 ┆ 20 ┆ 2.0 │
│ 3.0 ┆ 0.2 ┆ 20 ┆ 3.0 │
│ … ┆ … ┆ … ┆ … │
│ 96.0 ┆ 0.2 ┆ 20 ┆ 20.0 │
│ 97.0 ┆ 0.2 ┆ 20 ┆ 20.0 │
│ 98.0 ┆ 0.2 ┆ 20 ┆ 20.0 │
│ 99.0 ┆ 0.2 ┆ 20 ┆ 20.0 │
└───────────┴───────────────────┴───────────────────┴───────────────────┘
The error must be happening because of various casting of ints and floats and adding a .cast(pl.Float32)
seems to fix the issue. But in my actual use case, this operation was hidden within a more convoluted DataFrame operation so it was harder to track the origin of my bug. Maybe having polars throw a warning here would help? Something about boundaries not respected because of type mismatch?
Thanks for the work on polars!
Could you fix the test case? Currently doesn't run ;)
Oops sorry, it now works :)
Not sure what the right way to go is here. We could have clip
always return a float, but that feels inefficient...
I guess, either autocasting to the higher of the precisions or throwing a type mismatch error would be two legitimate options. I'm worried now that a simple warning would be too quiet and I cannot think of too many use-cases where having values wrongly clipped is ok?
I would assume x.clip(0.2, 20)
clips all integers below 0.2
and above 20
.
So I would assume 0
gets clipped up to 1
in an integer column.
Just got caught by surprise with this one today.
Similar issue, no warning with Series.scatter(), scattering float values into an int column.
I guess we should strict cast the inputs to the type of the column, rather than regular cast. That would raise an error with bad inputs and require you to cast the column manually first.
EDIT: Actually, we should use supertypes here.
EDIT2: Actually, we should not upcast the original column...