polars
polars copied to clipboard
Unexpected integer casts in LazyFrame leading to overflows
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
When cumsum-ing a boolean series over groups and later re-aggregating by the created counter, the casting is very touchy in unexpected ways and creates seeming overflows. In particular, even if the expression works fine in itself, subtracting 0 from the expression creates an overflow, but then "maintain_order" will prevent an overflow even in that case.
I've seen a number of issues related to integer casting but nothing specific to this problem that I can tell. Older version of Polars would error on my boolean cumsum, and it's a common enough operation in Numpy that I would think many users will try it.
Reproducible example
import polars as pl
import numpy as np
np.random.seed(1)
df = pl.DataFrame({"caseid": np.repeat(np.arange(12686), 24),
"race": [0] * 12686 * 12 + [1] * 12686 * 12,
"total_earnings": np.random.random(12686 * 24)}).lazy()
df = df.with_columns(
t_since_entry=(~pl.col("caseid").is_null()).cumsum().over("caseid") - 0,
t_since_entry2=(~pl.col("caseid").is_null()).cumsum().over("caseid"),
)
out0 = df.groupby(["race", "t_since_entry"]).agg([pl.mean("total_earnings")])
out1 = df.groupby(["race", "t_since_entry"], maintain_order=True).agg([pl.mean("total_earnings")])
out2 = df.groupby(["race", "t_since_entry2"]).agg([pl.mean("total_earnings")])
print(out0.collect())
print(out1.collect())
print(out2.collect())
shape: (1, 1)
┌───────┐
│ count │
│ --- │
│ u32 │
╞═══════╡
│ 24 │
└───────┘
shape: (48, 3)
┌──────┬───────────────┬────────────────┐
│ race ┆ t_since_entry ┆ total_earnings │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪═══════════════╪════════════════╡
│ 1 ┆ 46983545354 ┆ 0.495998 │
│ 0 ┆ 46983545098 ┆ 0.495829 │
│ 0 ┆ 46983549450 ┆ 0.504092 │
│ 1 ┆ 46983549706 ┆ 0.500767 │
│ … ┆ … ┆ … │
│ 1 ┆ 46983546890 ┆ 0.497791 │
│ 1 ┆ 46983546634 ┆ 0.494184 │
│ 1 ┆ 46983544842 ┆ 0.501515 │
│ 1 ┆ 46983548682 ┆ 0.49342 │
└──────┴───────────────┴────────────────┘
shape: (48, 3)
┌──────┬───────────────┬────────────────┐
│ race ┆ t_since_entry ┆ total_earnings │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ f64 │
╞══════╪═══════════════╪════════════════╡
│ 0 ┆ 1 ┆ 0.498934 │
│ 0 ┆ 2 ┆ 0.504765 │
│ 0 ┆ 3 ┆ 0.502967 │
│ 0 ┆ 4 ┆ 0.495539 │
│ … ┆ … ┆ … │
│ 1 ┆ 21 ┆ 0.498946 │
│ 1 ┆ 22 ┆ 0.498956 │
│ 1 ┆ 23 ┆ 0.49342 │
│ 1 ┆ 24 ┆ 0.501664 │
└──────┴───────────────┴────────────────┘
shape: (48, 3)
┌──────┬────────────────┬────────────────┐
│ race ┆ t_since_entry2 ┆ total_earnings │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ f64 │
╞══════╪════════════════╪════════════════╡
│ 1 ┆ 5 ┆ 0.495998 │
│ 0 ┆ 10 ┆ 0.495829 │
│ 0 ┆ 20 ┆ 0.504092 │
│ 1 ┆ 17 ┆ 0.500767 │
│ … ┆ … ┆ … │
│ 1 ┆ 19 ┆ 0.497791 │
│ 1 ┆ 4 ┆ 0.494184 │
│ 1 ┆ 15 ┆ 0.501515 │
│ 1 ┆ 23 ┆ 0.49342 │
└──────┴────────────────┴────────────────┘
Expected behavior
The final two output tables are correct.
Installed versions
--------Version info---------
Polars: 0.17.12
Index type: UInt32
Platform: Windows-10-10.0.19041-SP0
Python: 3.8.6 | packaged by conda-forge | (default, Dec 26 2020, 04:30:06) [MSC v.1916 64 bit (AMD64)]
----Optional dependencies----
numpy: 1.20.3
pandas: 1.2.0
pyarrow: 8.0.0
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.1.0
matplotlib: 3.3.3
xlsx2csv: <not installed>
xlsxwriter: <not installed>