Shift fails only on group_by when lazyframe is empty with SchemaMismatch Exception
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
df= pl.LazyFrame(
{
'a': [1,1,1, 2, 3],
'b': [True, True,True, True, True],
'c': [None ,None, None, None, None]})
df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
df = df.drop_nulls()
df2 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
Log output
thread '<unnamed>' panicked at crates\polars-lazy\src\physical_plan\expressions\group_iter.rs:46:37:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "D:/DEV/DEV-AMIR/I-MAD/test.py", line 43, in <module>
df = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
File "D:\DEV\DEV-AMIR\venvs\venv38\lib\site-packages\polars\lazyframe\frame.py", line 1706, in collect
return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))
Issue description
- Before the df.drop_nulls() the group_by works correctly:
df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
final df = shape: (3, 2)
┌─────┬──────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[bool] │
╞═════╪══════════════╡
│ 1 ┆ [true, true]
│ 2 ┆ []
│ 3 ┆ []
└─────┴──────────────┘
- After the df.drop_nulls(), so the df is empty we cannot apply the shift on empty column:
df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
==> SchemaMismatch(ErrString("invalid series dtype: expected List, got bool")
- When we test the shift(1) on empty LazyFrame (so on empty column) but with select, that works correctly:
df = pl.LazyFrame({"foo": [None]}) df= df.drop_nulls() df = df.select(pl.col("foo").where(pl.col("foo").is_not_null().shift(1))) print(df.collect()) ┌──────┐ │ foo │ │ --- │ │ null │ ╞══════╡ └──────┘
Expected behavior
As when we apply the shift on empty column for select or with_columns, the shift have to detect that the column is empty and return the same column.
Installed versions
Name: polars-lts-cpu
Version: 0.20.2
Summary: Blazingly fast DataFrame library
Home-page:
Author:
Author-email: Ritchie Vink <[email protected]>
License:
Location: d:\dev\dev-amir\venvs\venv38\lib\site-packages
Requires:
Required-by:
Looks like it also happens with a DataFrame:
df = pl.DataFrame(schema=dict(a=int, b=bool))
df.group_by("a").agg(pl.col("b").filter(pl.col("b").shift()))
# PanicException: called `Result::unwrap()` on an `Err` value:
# SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))