polars icon indicating copy to clipboard operation
polars copied to clipboard

map_groups has inconsistent behavior between lazy and eager. It should also accept expressions.

Open deanm0000 opened this issue 6 months ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df=pl.DataFrame({
    'a':[1,1,2,2,3,3],
    'b':range(6)
})
df.group_by(pl.col('a')).map_groups(lambda df: df)
# TypeError: cannot call `map_groups` when grouping by an expression
df.lazy().group_by(pl.col('a')).map_groups(lambda df: df, schema=df.schema).collect()
# seems to work
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
│ 3   ┆ 5   │
│ 1   ┆ 0   │
│ 1   ┆ 1   │
│ 2   ┆ 2   │
│ 2   ┆ 3   │
└─────┴─────┘

But the gotcha is if the expression in the group_by makes a change then those changes aren't visible to the function such as

print(df.lazy().group_by(pl.col('a')*2).map_groups(lambda df: df, schema=df.schema).collect())
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
│ 2   ┆ 3   │
│ 1   ┆ 0   │
│ 1   ┆ 1   │
│ 3   ┆ 4   │
│ 3   ┆ 5   │
└─────┴─────┘

Log output

No response

Issue description

group_by with agg just works, of course

Expected behavior

We can make map_groups work by having it do a with_columns before dispatching self.lgb.map_groups so essentially have it implicitly convert the above into:

(
    df.lazy()
    .with_columns(pl.col('a')*2)
    .group_by('a')
    .map_groups(lambda df: df, schema=df.schema)
    .collect()
    )
shape: (6, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 2   │
│ 4   ┆ 3   │
│ 6   ┆ 4   │
│ 6   ┆ 5   │
│ 2   ┆ 0   │
│ 2   ┆ 1   │
└─────┴─────┘

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python:               3.11.7 (main, Dec 19 2023, 20:42:30) [GCC 10.2.1 20210110]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.1
numpy:                1.26.1
openpyxl:             3.1.2
pandas:               2.1.2
pyarrow:              14.0.1
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.23
xlsx2csv:             0.8.1
xlsxwriter:           3.1.9

deanm0000 avatar Jan 04 '24 17:01 deanm0000

It seems like the eager version is implemented differently, and limited to string tye only in the grouping:

pub fn group_by_map_groups(
        &self,
        by: Vec<&str>,
        lambda: PyObject,
        maintain_order: bool,
    ) -> PyResult<Self>

barak1412 avatar Jan 05 '24 09:01 barak1412