polars icon indicating copy to clipboard operation
polars copied to clipboard

Potentially a bug with combination of concat_list, fill_null and groupby in LazyFrame

Open miroslaavi opened this issue 2 years ago • 2 comments
trafficstars

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

If you try running the below file with scan_parquet and read_parquet the results are different. LazyFrame ends up having 0 sum after the groupby while DataFrame works accordingly. The issue seems to be the "fill_null(0)" in the pl.concat_list part.

data.zip

Reproducible example

import polars as pl

(
    pl
    .scan_parquet(source="data.parquet")
    .select(
        [
            pl.col("Column_A"),
            pl.col("Column_B"),
            pl.concat_list(
                [
                    "Variable_1",
                    "Variable_2",
                ]
            )
            .arr.sum().fill_null(0)
            .alias("Variable_SUM"),
        ]
    )
    .groupby(["Column_A", "Column_B"]).agg([
        pl.col("Variable_SUM").sum().alias("New Sum")
    ])
    .collect()
)

Expected behavior

correct results in eager mode

Installed versions

---Version info---
Polars: 0.17.8
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]
---Optional dependencies---
numpy: 1.23.3
pandas: 2.0.1
pyarrow: 10.0.1
connectorx: 0.3.1
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: 0.8.1
xlsxwriter: 3.0.9

miroslaavi avatar Apr 25 '23 12:04 miroslaavi

Can you remove all columns and code that doesn't influence the bug?

ritchie46 avatar Apr 25 '23 12:04 ritchie46

I cleaned it a bit, so kept only two variable columns as the issue still persists. My findings:

  • If I remove the "fill_null(0)" after the pl.concat_list, then it works in LazyFrame as well
  • If I groupby only by one column, let's say "Column_A", it works as well
  • In LazyFrame the "Variable_SUM" column is Int64 prior the groupby while in DataFrame it is UInt32
  • If changing the groupby sum to count, it works

miroslaavi avatar Apr 25 '23 13:04 miroslaavi

Difference between lazy/eager seems to be fixed on main. Now both eager and lazy have 0 sums.

Lazy

shape: (9, 3)
┌──────────┬──────────┬─────────┐
│ Column_A ┆ Column_B ┆ New Sum │
│ ---      ┆ ---      ┆ ---     │
│ str      ┆ str      ┆ u32     │
╞══════════╪══════════╪═════════╡
│ null     ┆ null     ┆ 8       │
│ null     ┆ No       ┆ 18      │
│ null     ┆ Yes      ┆ 0       │
│ No       ┆ null     ┆ 0       │
│ No       ┆ No       ┆ 2624    │
│ No       ┆ Yes      ┆ 1113    │
│ Yes      ┆ null     ┆ 0       │
│ Yes      ┆ No       ┆ 1004    │
│ Yes      ┆ Yes      ┆ 5215    │
└──────────┴──────────┴─────────┘

Eager

shape: (9, 3)
┌──────────┬──────────┬─────────┐
│ Column_A ┆ Column_B ┆ New Sum │
│ ---      ┆ ---      ┆ ---     │
│ str      ┆ str      ┆ u32     │
╞══════════╪══════════╪═════════╡
│ null     ┆ null     ┆ 8       │
│ null     ┆ No       ┆ 18      │
│ null     ┆ Yes      ┆ 0       │
│ No       ┆ null     ┆ 0       │
│ No       ┆ No       ┆ 2624    │
│ No       ┆ Yes      ┆ 1113    │
│ Yes      ┆ null     ┆ 0       │
│ Yes      ┆ No       ┆ 1004    │
│ Yes      ┆ Yes      ┆ 5215    │
└──────────┴──────────┴─────────┘

Object905 avatar Jul 19 '24 17:07 Object905