polars
polars copied to clipboard
`.concat_list` with `.list` inside `.groupby` leads to results assigned to wrong groups
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
I'm not sure if this type of usage makes much sense - but I was messing around with various combinations of .list()
and pl.concat_list()
and observed the following:
Passing pl.col().list()
to pl.concat_list()
inside .groupby().agg()
causes results to be assigned to the wrong groups.
Additionally, adding in .unique()
appears to cause a 100% CPU / MEM% spike in some cases and "hangs".
Reproducible example
import polars as pl
df = pl.DataFrame({"group": [1, 2, 2, 3], "value": ["a", "b", "c", "d"]})
# Okay
df.groupby("group").agg(pl.concat_list(pl.col("value")))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value │
│ --- | --- │
│ i64 | list[list[str]] │
╞═══════╪═════════════════╡
│ 1 | [["a"]] │
│ 2 | [["b"], ["c"]] │
│ 3 | [["d"]] │
└───────┴─────────────────┘
"""
# Adding `.list()`
# Not Okay - "a" and "d" are in the wrong groups
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value │
│ --- | --- │
│ i64 | list[list[str]] │
╞═══════╪═════════════════╡
│ 3 | [["a"]] │
│ 2 | [["b"], ["c"]] │
│ 1 | [["d"]] │
└───────┴─────────────────┘
"""
# Not Okay - "c" and "d" are in the wrong groups
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value │
│ --- | --- │
│ i64 | list[list[str]] │
╞═══════╪═════════════════╡
│ 1 | [["a"]] │
│ 3 | [["c"]] │
│ 2 | [["d"], ["b"]] │
└───────┴─────────────────┘
"""
# Possibly related issue - I added in a .unique() call
df.groupby("group").agg(pl.concat_list(pl.col("value").unique().list().suffix("_agg")))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value_agg │
│ --- | --- │
│ i64 | list[list[str]] │
╞═══════╪═════════════════╡
│ 1 | [["a"]] │
│ 2 | [["c"], ["b"]] │
│ 3 | [["d"]] │
└───────┴─────────────────┘
"""
# Using the same .agg on "group" instead of "value"
# causes 100% CPU / large MEM% spike and does not return
df.groupby("group").agg(pl.concat_list(pl.col("group").unique().list().suffix("_agg")))
# If the groups are of length 1
# the CPU/MEM is not present - returns result immediately
df = pl.DataFrame({"group": [1, 2, 3], "value": ["a", "b", "c"]})
df.groupby("group").agg(pl.concat_list(pl.col("group").unique().list().suffix("_agg")))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | group_agg │
│ --- | --- │
│ i64 | list[list[i64]] │
╞═══════╪═════════════════╡
│ 2 | [[3]] │
│ 3 | [[1]] │
│ 1 | [[2]] │
└───────┴─────────────────┘
"""
# The values assigned to wrong groups issue remains
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))
"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value │
│ --- | --- │
│ i64 | list[list[str]] │
╞═══════╪═════════════════╡
│ 3 | [["b"]] │
│ 1 | [["c"]] │
│ 2 | [["a"]] │
└───────┴─────────────────┘
"""
Expected behavior
Assign values to correct groups.
Installed versions
---Version info---
Polars: 0.15.16
Index type: UInt32
Python: 3.10.9 (main, Jan 14 2023, 16:56:26)
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.3
numpy: 1.23.5
fsspec: 2022.8.2
connectorx: <not installed>
xlsx2csv: 0.8
deltalake: <not installed>
matplotlib: 3.5.1