polars icon indicating copy to clipboard operation
polars copied to clipboard

Shift fails only on group_by when lazyframe is empty with SchemaMismatch Exception

Open amirajina opened this issue 2 years ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
df= pl.LazyFrame(
        {
            'a': [1,1,1, 2, 3],
            'b': [True, True,True, True, True],
             'c': [None ,None, None, None, None]})

df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()

df = df.drop_nulls()

df2 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()

Log output

thread '<unnamed>' panicked at crates\polars-lazy\src\physical_plan\expressions\group_iter.rs:46:37:
called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "D:/DEV/DEV-AMIR/I-MAD/test.py", line 43, in <module>
    df = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect()
  File "D:\DEV\DEV-AMIR\venvs\venv38\lib\site-packages\polars\lazyframe\frame.py", line 1706, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))

Issue description

  1. Before the df.drop_nulls() the group_by works correctly:

df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect() final df = shape: (3, 2) ┌─────┬──────────────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ list[bool] │ ╞═════╪══════════════╡ │ 1 ┆ [true, true] │ 2 ┆ []
│ 3 ┆ []
└─────┴──────────────┘

  1. After the df.drop_nulls(), so the df is empty we cannot apply the shift on empty column:

df1 = df.group_by(by=['a'], maintain_order=True).agg(pl.col('b').where(pl.col('b').shift(1))).collect() ==> SchemaMismatch(ErrString("invalid series dtype: expected List, got bool")

  1. When we test the shift(1) on empty LazyFrame (so on empty column) but with select, that works correctly:

df = pl.LazyFrame({"foo": [None]}) df= df.drop_nulls() df = df.select(pl.col("foo").where(pl.col("foo").is_not_null().shift(1))) print(df.collect()) ┌──────┐ │ foo │ │ --- │ │ null │ ╞══════╡ └──────┘

Expected behavior

As when we apply the shift on empty column for select or with_columns, the shift have to detect that the column is empty and return the same column.

Installed versions

Name: polars-lts-cpu
Version: 0.20.2
Summary: Blazingly fast DataFrame library
Home-page:
Author:
Author-email: Ritchie Vink <[email protected]>
License:
Location: d:\dev\dev-amir\venvs\venv38\lib\site-packages
Requires:
Required-by:

amirajina avatar Jan 10 '24 17:01 amirajina

Looks like it also happens with a DataFrame:

df = pl.DataFrame(schema=dict(a=int, b=bool))
df.group_by("a").agg(pl.col("b").filter(pl.col("b").shift()))

# PanicException: called `Result::unwrap()` on an `Err` value: 
#    SchemaMismatch(ErrString("invalid series dtype: expected `List`, got `bool`"))

cmdlineluser avatar Jan 10 '24 18:01 cmdlineluser