polars icon indicating copy to clipboard operation
polars copied to clipboard

groupby for list and struct type columns

Open peterlietz opened this issue 2 years ago • 5 comments

Thank you for this absolutely wonderful library!

I'm afraid I hit a snag. What I tried to do was to group by a nested data type, as in:

df = pl.DataFrame({"a": [1, 2, 3], "b": [[1, 3, 4], [2, 4, 6], [17]]})
df.groupby("b").agg(pl.sum("a"))

This results in a not implemented panic.

I'm curious as to whether this is simply not implemented yet or whether this would contradict the underlying philosophy of polars.

Best regards Peter

peterlietz avatar Jul 29 '22 09:07 peterlietz

We do not support grouping by a column of type list. I think we should improve the error message on that.

ritchie46 avatar Jul 29 '22 09:07 ritchie46

Thank you very much for the quick answer!

peterlietz avatar Jul 29 '22 09:07 peterlietz

We do not support grouping by a column of type list. I think we should improve the error message on that.

@ritchie46 what do you think should be the error that comes out? DataTypeMisMatch?

pepelovesvim avatar Jul 29 '22 20:07 pepelovesvim

I think a ComputeError would be most consistent.

For structs we could temporarily unnest -> do the groupby -> and nest again.

ritchie46 avatar Jul 29 '22 21:07 ritchie46

Just in case anybody else stumbles upon this, the workaround I am now using is to convert to "str". Not ideal, but does the trick.

df = pl.DataFrame({"a": [1, 2, 3], "b": [[1, 3, 4], [2, 4, 6], [17]]})
df = df.with_column(pl.col("b").arr.eval(pl.element().cast(pl.Utf8)).arr.join("|"))
df.groupby("b").agg(pl.sum("a"))

peterlietz avatar Jul 30 '22 16:07 peterlietz