polars
polars copied to clipboard
Non-deterministic behaviour when using `is_null()` in LazyFrame
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Unfortunately I couldn't get a reproducible example without my data (though I tried quite hard!), but I am willing to spend some time on this if someone has an idea of how to get one.
df
is a LazyFrame read from a delta table.
ids_set = {
'02d0927b-77ea-400f-8adc-22474e45d6d5',
'03094785-91c7-4e98-9072-3336ff67c222',
'031d9dfb-38c2-4229-92b3-d397b6d0313b',
'033d8ebd-07ca-467a-8833-f5c23138746b',
'0347c17f-dd73-43fc-969d-2e46b6406dea'
}
df_filtered = df.filter(pl.col("label_id").is_in(ids_set))
tmp = df_filtered.select('label_id', 'code', pl.col("code").is_not_null().alias("code_not_null"))
print('"original" dataframe')
print(tmp.collect())
print()
print("shapes")
for _ in range(10):
tmp1 = tmp.collect().filter(pl.col("code").is_not_null())
tmp2 = tmp.filter(pl.col("code").is_not_null()).collect()
tmp3 = tmp.filter(pl.col("code_not_null")).collect()
print(tmp1.shape, tmp2.shape, tmp3.shape)
Outputs:
"original" dataframe
shape: (5, 3)
┌───────────────────────────────────┬───────┬───────────────┐
│ label_id ┆ code ┆ code_not_null │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞═══════════════════════════════════╪═══════╪═══════════════╡
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null ┆ false │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null ┆ false │
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null ┆ false │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null ┆ false │
└───────────────────────────────────┴───────┴───────────────┘
shapes
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
Log output
No response
Issue description
Filtering by pl.col().is_null()
or pl.col().is_not_null()
before collecting gives me a non-deterministic wrong result.
I really tried to get a completely reproducible example, but did not succeed. Here's what I tried:
- make a small dataframe with some nulls -> convert to lazy -> run the code above 1000s of times
- save the dataframe to delta -> load with
pl.scan_delta
-> run the code above also 1000s of times
In both these cases, the results were consistently correct.
However, for the example in my data, something is clearly wrong.
Expected behavior
In the code snippet above, I'm filtering a lazyframe by null values in a column, and printing out the shape of the output. I'm doing that in 3 different ways:
- collect and then filter
- filter and then collect
- filter on a boolean column that represents the same filter, and then collect.
I expected that all 3 of these to give the exact same result. However, the filtering in nr 2 only works sometimes.
Installed versions
--------Version info---------
Polars: 0.20.10
Index type: UInt32
Platform: Linux-5.15.0-1043-aws-x86_64-with-glibc2.31
Python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: 0.15.3
fsspec: 2024.2.0
gevent: 24.2.1
hvplot: <not installed>
matplotlib: 3.8.3
numpy: 1.26.4
openpyxl: <not installed>
pandas: 2.0.1
pyarrow: 15.0.0
pydantic: 1.10.14
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 1.4.51
xlsx2csv: <not installed>
xlsxwriter: <not installed>