polars icon indicating copy to clipboard operation
polars copied to clipboard

`.concat_list` with `.list` inside `.groupby` leads to results assigned to wrong groups

Open cmdlineluser opened this issue 2 years ago • 0 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

I'm not sure if this type of usage makes much sense - but I was messing around with various combinations of .list() and pl.concat_list() and observed the following:

Passing pl.col().list() to pl.concat_list() inside .groupby().agg() causes results to be assigned to the wrong groups.

Additionally, adding in .unique() appears to cause a 100% CPU / MEM% spike in some cases and "hangs".

Reproducible example

import polars as pl

df = pl.DataFrame({"group": [1, 2, 2, 3], "value": ["a", "b", "c", "d"]})

# Okay
df.groupby("group").agg(pl.concat_list(pl.col("value")))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value           │
│ ---   | ---             │
│ i64   | list[list[str]] │
╞═══════╪═════════════════╡
│ 1     | [["a"]]         │
│ 2     | [["b"], ["c"]]  │
│ 3     | [["d"]]         │
└───────┴─────────────────┘
"""

# Adding `.list()`
# Not Okay - "a" and "d" are in the wrong groups
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value           │
│ ---   | ---             │
│ i64   | list[list[str]] │
╞═══════╪═════════════════╡
│ 3     | [["a"]]         │
│ 2     | [["b"], ["c"]]  │
│ 1     | [["d"]]         │
└───────┴─────────────────┘
"""

# Not Okay - "c" and "d" are in the wrong groups
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value           │
│ ---   | ---             │
│ i64   | list[list[str]] │
╞═══════╪═════════════════╡
│ 1     | [["a"]]         │
│ 3     | [["c"]]         │
│ 2     | [["d"], ["b"]]  │
└───────┴─────────────────┘
"""

# Possibly related issue - I added in a .unique() call
df.groupby("group").agg(pl.concat_list(pl.col("value").unique().list().suffix("_agg")))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value_agg       │
│ ---   | ---             │
│ i64   | list[list[str]] │
╞═══════╪═════════════════╡
│ 1     | [["a"]]         │
│ 2     | [["c"], ["b"]]  │
│ 3     | [["d"]]         │
└───────┴─────────────────┘
"""

# Using the same .agg on "group" instead of "value" 
# causes 100% CPU / large MEM% spike and does not return
df.groupby("group").agg(pl.concat_list(pl.col("group").unique().list().suffix("_agg")))


# If the groups are of length 1
# the CPU/MEM is not present - returns result immediately
df = pl.DataFrame({"group": [1, 2, 3], "value": ["a", "b", "c"]})
df.groupby("group").agg(pl.concat_list(pl.col("group").unique().list().suffix("_agg")))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | group_agg       │
│ ---   | ---             │
│ i64   | list[list[i64]] │
╞═══════╪═════════════════╡
│ 2     | [[3]]           │
│ 3     | [[1]]           │
│ 1     | [[2]]           │
└───────┴─────────────────┘
"""

# The values assigned to wrong groups issue remains
df.groupby("group").agg(pl.concat_list(pl.col("value").list()))

"""
shape: (3, 2)
┌───────┬─────────────────┐
│ group | value           │
│ ---   | ---             │
│ i64   | list[list[str]] │
╞═══════╪═════════════════╡
│ 3     | [["b"]]         │
│ 1     | [["c"]]         │
│ 2     | [["a"]]         │
└───────┴─────────────────┘
"""

Expected behavior

Assign values to correct groups.

Installed versions

---Version info---
Polars: 0.15.16
Index type: UInt32
Python: 3.10.9 (main, Jan 14 2023, 16:56:26)
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.3
numpy: 1.23.5
fsspec: 2022.8.2
connectorx: <not installed>
xlsx2csv: 0.8
deltalake: <not installed>
matplotlib: 3.5.1

cmdlineluser avatar Jan 24 '23 02:01 cmdlineluser