polars
polars copied to clipboard
Potentially a bug with combination of concat_list, fill_null and groupby in LazyFrame
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
If you try running the below file with scan_parquet and read_parquet the results are different. LazyFrame ends up having 0 sum after the groupby while DataFrame works accordingly. The issue seems to be the "fill_null(0)" in the pl.concat_list part.
Reproducible example
import polars as pl
(
pl
.scan_parquet(source="data.parquet")
.select(
[
pl.col("Column_A"),
pl.col("Column_B"),
pl.concat_list(
[
"Variable_1",
"Variable_2",
]
)
.arr.sum().fill_null(0)
.alias("Variable_SUM"),
]
)
.groupby(["Column_A", "Column_B"]).agg([
pl.col("Variable_SUM").sum().alias("New Sum")
])
.collect()
)
Expected behavior
correct results in eager mode
Installed versions
---Version info---
Polars: 0.17.8
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]
---Optional dependencies---
numpy: 1.23.3
pandas: 2.0.1
pyarrow: 10.0.1
connectorx: 0.3.1
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: 0.8.1
xlsxwriter: 3.0.9
Can you remove all columns and code that doesn't influence the bug?
I cleaned it a bit, so kept only two variable columns as the issue still persists. My findings:
- If I remove the "fill_null(0)" after the pl.concat_list, then it works in LazyFrame as well
- If I groupby only by one column, let's say "Column_A", it works as well
- In LazyFrame the "Variable_SUM" column is Int64 prior the groupby while in DataFrame it is UInt32
- If changing the groupby sum to count, it works
Difference between lazy/eager seems to be fixed on main. Now both eager and lazy have 0 sums.
Lazy
shape: (9, 3)
┌──────────┬──────────┬─────────┐
│ Column_A ┆ Column_B ┆ New Sum │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════════╪══════════╪═════════╡
│ null ┆ null ┆ 8 │
│ null ┆ No ┆ 18 │
│ null ┆ Yes ┆ 0 │
│ No ┆ null ┆ 0 │
│ No ┆ No ┆ 2624 │
│ No ┆ Yes ┆ 1113 │
│ Yes ┆ null ┆ 0 │
│ Yes ┆ No ┆ 1004 │
│ Yes ┆ Yes ┆ 5215 │
└──────────┴──────────┴─────────┘
Eager
shape: (9, 3)
┌──────────┬──────────┬─────────┐
│ Column_A ┆ Column_B ┆ New Sum │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════════╪══════════╪═════════╡
│ null ┆ null ┆ 8 │
│ null ┆ No ┆ 18 │
│ null ┆ Yes ┆ 0 │
│ No ┆ null ┆ 0 │
│ No ┆ No ┆ 2624 │
│ No ┆ Yes ┆ 1113 │
│ Yes ┆ null ┆ 0 │
│ Yes ┆ No ┆ 1004 │
│ Yes ┆ Yes ┆ 5215 │
└──────────┴──────────┴─────────┘