polars
polars copied to clipboard
LazyFrame.with_context.filter fails to evaluate
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
It seems LazyFrame.with_context().filter()
does not correctly filter both frames (see code below) and there is no alternative without collecting the result.
This is mostly an issue because horizontal stacking is not supported for lazy frames (see https://github.com/pola-rs/polars/issues/2856).
Reproducible example
>>> a = pl.DataFrame({"a": [1, 2, 3, 4], "f": ["x", "x", None, "z"]}).lazy()
>>> b = pl.DataFrame({"b": [5, 6, 7, 8]}).lazy()
>>> query = a.with_context(b).select(pl.all()).filter(pl.col("f").is_not_null())
>>> query.collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/felix/Projects/polars/py-polars/polars/utils.py", line 310, in wrapper
return fn(*args, **kwargs)
File "/Users/felix/Projects/polars/py-polars/polars/internals/lazyframe/frame.py", line 1164, in collect
return pli.wrap_df(ldf.collect())
exceptions.ComputeError: Series shape: (3,)
Series: 'a' [i64]
[
1
2
4
] does not match the DataFrame height of 4
>>> print(query.describe_optimized_plan())
SELECT [col("a"), col("f"), col("b")] FROM
EXTERNAL_CONTEXT
DF ["a", "f"]; PROJECT 2/2 COLUMNS; SELECTION: "col(\"f\").is_not_null()"
Expected behavior
>>> a = pl.DataFrame({"a": [1, 2, 3, 4], "f": ["x", "x", None, "z"]})
>>> b = pl.DataFrame({"b": [5, 6, 7, 8]})
>>> pl.concat([a, b], how="horizontal").filter(pl.col("f").is_not_null())
shape: (3, 3)
┌─────┬─────┬─────┐
│ a ┆ f ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ x ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ x ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ z ┆ 8 │
└─────┴─────┴─────┘
Installed versions
>>> pl.show_versions()
---Version info---
Polars: 0.15.2
Index type: UInt32
Platform: macOS-11.7.1-x86_64-i386-64bit
Python: 3.10.8 (main, Oct 13 2022, 10:18:28) [Clang 13.0.0 (clang-1300.0.29.30)]
---Optional dependencies---
pyarrow: 11.0.0.dev203
pandas: 1.5.2
numpy: 1.24.0rc1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: 0.8
matplotlib: <not installed>
This is not how with_context
is meant to work. You need to filter both LazyFrame
s.
Do you have an example of how I could achieve the same result as the non-lazy code without using with_context
but still using LazyFrame
s?
If you don't have a real key between the two frames you could establish a transient common column...
a = a.with_row_count( name='idx' )
b = b.with_row_count( name='idx' )
...join on that, and then discard:
c = a.join( b, on='idx' ).drop( columns=['idx'] )
c.filter( pl.col("f").is_not_null() ).collect()
# ┌─────┬─────┬─────┐
# │ a ┆ f ┆ b │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1 ┆ x ┆ 5 │
# │ 2 ┆ x ┆ 6 │
# │ 4 ┆ z ┆ 8 │
# └─────┴─────┴─────┘
(If you have any earlier operations with potentially non-deterministic ordering you will need a 'real' key though).