polars icon indicating copy to clipboard operation
polars copied to clipboard

LazyFrame.with_context.filter fails to evaluate

Open fsimkovic opened this issue 2 years ago • 3 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Issue description

It seems LazyFrame.with_context().filter() does not correctly filter both frames (see code below) and there is no alternative without collecting the result.

This is mostly an issue because horizontal stacking is not supported for lazy frames (see https://github.com/pola-rs/polars/issues/2856).

Reproducible example

>>> a = pl.DataFrame({"a": [1, 2, 3, 4], "f": ["x", "x", None, "z"]}).lazy()
>>> b = pl.DataFrame({"b": [5, 6, 7, 8]}).lazy()
>>> query = a.with_context(b).select(pl.all()).filter(pl.col("f").is_not_null())
>>> query.collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/felix/Projects/polars/py-polars/polars/utils.py", line 310, in wrapper
    return fn(*args, **kwargs)
  File "/Users/felix/Projects/polars/py-polars/polars/internals/lazyframe/frame.py", line 1164, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ComputeError: Series shape: (3,)
Series: 'a' [i64]
[
        1
        2
        4
] does not match the DataFrame height of 4
>>> print(query.describe_optimized_plan())
   SELECT [col("a"), col("f"), col("b")] FROM
    EXTERNAL_CONTEXT
      DF ["a", "f"]; PROJECT 2/2 COLUMNS; SELECTION: "col(\"f\").is_not_null()"

Expected behavior

>>> a = pl.DataFrame({"a": [1, 2, 3, 4], "f": ["x", "x", None, "z"]})
>>> b = pl.DataFrame({"b": [5, 6, 7, 8]})
>>> pl.concat([a, b], how="horizontal").filter(pl.col("f").is_not_null())
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ f   ┆ b   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ x   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ x   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ z   ┆ 8   │
└─────┴─────┴─────┘

Installed versions

>>> pl.show_versions()
---Version info---
Polars: 0.15.2
Index type: UInt32
Platform: macOS-11.7.1-x86_64-i386-64bit
Python: 3.10.8 (main, Oct 13 2022, 10:18:28) [Clang 13.0.0 (clang-1300.0.29.30)]
---Optional dependencies---
pyarrow: 11.0.0.dev203
pandas: 1.5.2
numpy: 1.24.0rc1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: 0.8
matplotlib: <not installed>

fsimkovic avatar Dec 05 '22 21:12 fsimkovic

This is not how with_context is meant to work. You need to filter both LazyFrames.

ritchie46 avatar Dec 06 '22 07:12 ritchie46

Do you have an example of how I could achieve the same result as the non-lazy code without using with_context but still using LazyFrames?

fsimkovic avatar Dec 06 '22 09:12 fsimkovic

If you don't have a real key between the two frames you could establish a transient common column...

a = a.with_row_count( name='idx' )
b = b.with_row_count( name='idx' )

...join on that, and then discard:

c = a.join( b, on='idx' ).drop( columns=['idx'] )
c.filter( pl.col("f").is_not_null() ).collect()

# ┌─────┬─────┬─────┐
# │ a   ┆ f   ┆ b   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ x   ┆ 5   │
# │ 2   ┆ x   ┆ 6   │
# │ 4   ┆ z   ┆ 8   │
# └─────┴─────┴─────┘

(If you have any earlier operations with potentially non-deterministic ordering you will need a 'real' key though).

alexander-beedie avatar Dec 06 '22 09:12 alexander-beedie