polars icon indicating copy to clipboard operation
polars copied to clipboard

Polars lit, scalar error with over clause

Open jesusestevez opened this issue 1 year ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [1, 2,3],})

df.with_columns(pl.concat_list(pl.col("a"))
                .over(pl.lit(""), mapping_strategy="join")
                .list.eval(pl.element().count()).alias("b"))

Log output

No response

Issue description

I would like to build a count of elements over a placeholder column that does not exist on the dataframe. We could use the code in the first chunk of code below to do so as of version 1.4.1:

import polars as pl

df = pl.DataFrame({"a": [1, 2,3],})

df.with_columns(pl.concat_list(pl.col("a"))
                .over(pl.lit(""), mapping_strategy="join")
                .list.eval(pl.element().count()).alias("b"))

While in version 1.9.0, we get the following error, even using pl.lit("").first():

InvalidOperationError: Series b, length 1 doesn't match the DataFrame height of 3

If you want expression: col("a").list.concat().over([String()]).eval() to be broadcasted, ensure it is a scalar (for instance by adding '.first()').

Which does not fail, but provides the wrong output, with mepping_strategy = 'group_to_rows'

Expected behavior

The ouput should be:

shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[u32] │
╞═════╪═══════════╡
│ 1   ┆ [3]       │
│ 2   ┆ [3]       │
│ 3   ┆ [3]       │
└─────┴───────────┘

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.

jesusestevez avatar Oct 15 '24 14:10 jesusestevez

In general we currently do not handle broadcasting of scalar lists correctly. We don't correctly distinguish between a scalar List expression and a Series expression. For example, this is correct, because a Series expression inside an over context should match the length of the group. Scalar expressions should broadcast to each element in the group.:

>>> df = pl.DataFrame({"x": [1, 2, 3], "g": [1, 1, 2]})
>>> df.select(pl.col.x.reverse().over("g"))
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 2   │
│ 1   │
│ 3   │
└─────┘
>>> df.select(pl.col.x.first().over("g"))
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 1   │
│ 3   │
└─────┘

However, this is incorrect:

>>> df.select(pl.col.x.implode().over("g"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/orlp/.localpython/lib/python3.11/site-packages/polars/dataframe/frame.py", line 9010, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/orlp/.localpython/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2050, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: the length of the window expression did not match that of the group
> group: (1)
> group length: 1
> output: 'shape: (1,)
Series: '' [list[i64]]
[
        [1, 2]
]'

Error originated in expression: 'col("x").list().over([col("g")])'

This should just broadcast, as Expr.implode() is a scalar expression returning a list. This should result in:

┌───────────┐
│ x         │
│ --------- │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [1, 2]    │
│ [3]       │
└───────────┘

orlp avatar Oct 16 '24 09:10 orlp