polars icon indicating copy to clipboard operation
polars copied to clipboard

Unexpected integer casts in LazyFrame leading to overflows

Open mishpat opened this issue 1 year ago • 0 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

When cumsum-ing a boolean series over groups and later re-aggregating by the created counter, the casting is very touchy in unexpected ways and creates seeming overflows. In particular, even if the expression works fine in itself, subtracting 0 from the expression creates an overflow, but then "maintain_order" will prevent an overflow even in that case.

I've seen a number of issues related to integer casting but nothing specific to this problem that I can tell. Older version of Polars would error on my boolean cumsum, and it's a common enough operation in Numpy that I would think many users will try it.

Reproducible example

import polars as pl
import numpy as np

np.random.seed(1)
df = pl.DataFrame({"caseid": np.repeat(np.arange(12686), 24),
                   "race": [0] * 12686 * 12 + [1] * 12686 * 12,
                   "total_earnings": np.random.random(12686 * 24)}).lazy()
df = df.with_columns(
    t_since_entry=(~pl.col("caseid").is_null()).cumsum().over("caseid") - 0,
    t_since_entry2=(~pl.col("caseid").is_null()).cumsum().over("caseid"),
)
out0 = df.groupby(["race", "t_since_entry"]).agg([pl.mean("total_earnings")])
out1 = df.groupby(["race", "t_since_entry"], maintain_order=True).agg([pl.mean("total_earnings")])
out2 = df.groupby(["race", "t_since_entry2"]).agg([pl.mean("total_earnings")])

print(out0.collect())
print(out1.collect())
print(out2.collect())

shape: (1, 1)
┌───────┐
│ count │
│ ---   │
│ u32   │
╞═══════╡
│ 24    │
└───────┘
shape: (48, 3)
┌──────┬───────────────┬────────────────┐
│ race ┆ t_since_entry ┆ total_earnings │
│ ---  ┆ ---           ┆ ---            │
│ i64  ┆ i64           ┆ f64            │
╞══════╪═══════════════╪════════════════╡
│ 1    ┆ 46983545354   ┆ 0.495998       │
│ 0    ┆ 46983545098   ┆ 0.495829       │
│ 0    ┆ 46983549450   ┆ 0.504092       │
│ 1    ┆ 46983549706   ┆ 0.500767       │
│ …    ┆ …             ┆ …              │
│ 1    ┆ 46983546890   ┆ 0.497791       │
│ 1    ┆ 46983546634   ┆ 0.494184       │
│ 1    ┆ 46983544842   ┆ 0.501515       │
│ 1    ┆ 46983548682   ┆ 0.49342        │
└──────┴───────────────┴────────────────┘
shape: (48, 3)
┌──────┬───────────────┬────────────────┐
│ race ┆ t_since_entry ┆ total_earnings │
│ ---  ┆ ---           ┆ ---            │
│ i64  ┆ u32           ┆ f64            │
╞══════╪═══════════════╪════════════════╡
│ 0    ┆ 1             ┆ 0.498934       │
│ 0    ┆ 2             ┆ 0.504765       │
│ 0    ┆ 3             ┆ 0.502967       │
│ 0    ┆ 4             ┆ 0.495539       │
│ …    ┆ …             ┆ …              │
│ 1    ┆ 21            ┆ 0.498946       │
│ 1    ┆ 22            ┆ 0.498956       │
│ 1    ┆ 23            ┆ 0.49342        │
│ 1    ┆ 24            ┆ 0.501664       │
└──────┴───────────────┴────────────────┘
shape: (48, 3)
┌──────┬────────────────┬────────────────┐
│ race ┆ t_since_entry2 ┆ total_earnings │
│ ---  ┆ ---            ┆ ---            │
│ i64  ┆ u32            ┆ f64            │
╞══════╪════════════════╪════════════════╡
│ 1    ┆ 5              ┆ 0.495998       │
│ 0    ┆ 10             ┆ 0.495829       │
│ 0    ┆ 20             ┆ 0.504092       │
│ 1    ┆ 17             ┆ 0.500767       │
│ …    ┆ …              ┆ …              │
│ 1    ┆ 19             ┆ 0.497791       │
│ 1    ┆ 4              ┆ 0.494184       │
│ 1    ┆ 15             ┆ 0.501515       │
│ 1    ┆ 23             ┆ 0.49342        │
└──────┴────────────────┴────────────────┘

Expected behavior

The final two output tables are correct.

Installed versions

--------Version info---------
Polars:      0.17.12
Index type:  UInt32
Platform:    Windows-10-10.0.19041-SP0
Python:      3.8.6 | packaged by conda-forge | (default, Dec 26 2020, 04:30:06) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.20.3
pandas:      1.2.0
pyarrow:     8.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      2023.1.0
matplotlib:  3.3.3
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>

mishpat avatar May 07 '23 05:05 mishpat