polars
polars copied to clipboard
map_groups has inconsistent behavior between lazy and eager. It should also accept expressions.
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
df=pl.DataFrame({
'a':[1,1,2,2,3,3],
'b':range(6)
})
df.group_by(pl.col('a')).map_groups(lambda df: df)
# TypeError: cannot call `map_groups` when grouping by an expression
df.lazy().group_by(pl.col('a')).map_groups(lambda df: df, schema=df.schema).collect()
# seems to work
shape: (6, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3 ┆ 4 │
│ 3 ┆ 5 │
│ 1 ┆ 0 │
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 2 ┆ 3 │
└─────┴─────┘
But the gotcha is if the expression in the group_by makes a change then those changes aren't visible to the function such as
print(df.lazy().group_by(pl.col('a')*2).map_groups(lambda df: df, schema=df.schema).collect())
shape: (6, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 2 │
│ 2 ┆ 3 │
│ 1 ┆ 0 │
│ 1 ┆ 1 │
│ 3 ┆ 4 │
│ 3 ┆ 5 │
└─────┴─────┘
Log output
No response
Issue description
group_by with agg just works, of course
Expected behavior
We can make map_groups
work by having it do a with_columns
before dispatching self.lgb.map_groups
so essentially have it implicitly convert the above into:
(
df.lazy()
.with_columns(pl.col('a')*2)
.group_by('a')
.map_groups(lambda df: df, schema=df.schema)
.collect()
)
shape: (6, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4 ┆ 2 │
│ 4 ┆ 3 │
│ 6 ┆ 4 │
│ 6 ┆ 5 │
│ 2 ┆ 0 │
│ 2 ┆ 1 │
└─────┴─────┘
Installed versions
--------Version info---------
Polars: 0.20.3
Index type: UInt32
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python: 3.11.7 (main, Dec 19 2023, 20:42:30) [GCC 10.2.1 20210110]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: 0.3.2
deltalake: <not installed>
fsspec: 2023.10.0
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.1
numpy: 1.26.1
openpyxl: 3.1.2
pandas: 2.1.2
pyarrow: 14.0.1
pydantic: 2.5.2
pyiceberg: <not installed>
pyxlsb: 1.0.10
sqlalchemy: 2.0.23
xlsx2csv: 0.8.1
xlsxwriter: 3.1.9
It seems like the eager version is implemented differently, and limited to string tye only in the grouping:
pub fn group_by_map_groups(
&self,
by: Vec<&str>,
lambda: PyObject,
maintain_order: bool,
) -> PyResult<Self>