polars icon indicating copy to clipboard operation
polars copied to clipboard

Non-deterministic behaviour when using `is_null()` in LazyFrame

Open dcferreira opened this issue 4 months ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Unfortunately I couldn't get a reproducible example without my data (though I tried quite hard!), but I am willing to spend some time on this if someone has an idea of how to get one.

df is a LazyFrame read from a delta table.

ids_set = {
    '02d0927b-77ea-400f-8adc-22474e45d6d5',
    '03094785-91c7-4e98-9072-3336ff67c222',
    '031d9dfb-38c2-4229-92b3-d397b6d0313b',
    '033d8ebd-07ca-467a-8833-f5c23138746b',
    '0347c17f-dd73-43fc-969d-2e46b6406dea'
}
df_filtered = df.filter(pl.col("label_id").is_in(ids_set))
tmp = df_filtered.select('label_id', 'code', pl.col("code").is_not_null().alias("code_not_null"))

print('"original" dataframe')
print(tmp.collect())
print()

print("shapes")
for _ in range(10):
    tmp1 = tmp.collect().filter(pl.col("code").is_not_null())
    tmp2 = tmp.filter(pl.col("code").is_not_null()).collect()
    tmp3 = tmp.filter(pl.col("code_not_null")).collect()
    
    print(tmp1.shape, tmp2.shape, tmp3.shape)

Outputs:

"original" dataframe
shape: (5, 3)
┌───────────────────────────────────┬───────┬───────────────┐
│ label_id                          ┆ code  ┆ code_not_null │
│ ---                               ┆ ---   ┆ ---           │
│ str                               ┆ str   ┆ bool          │
╞═══════════════════════════════════╪═══════╪═══════════════╡
│ 033d8ebd-07ca-467a-8833-f5c23138… ┆ A-P-2 ┆ true          │
│ 0347c17f-dd73-43fc-969d-2e46b640… ┆ null  ┆ false         │
│ 02d0927b-77ea-400f-8adc-22474e45… ┆ null  ┆ false         │
│ 031d9dfb-38c2-4229-92b3-d397b6d0… ┆ null  ┆ false         │
│ 03094785-91c7-4e98-9072-3336ff67… ┆ null  ┆ false         │
└───────────────────────────────────┴───────┴───────────────┘

shapes
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)
(1, 3) (1, 3) (1, 3)
(1, 3) (5, 3) (1, 3)

Log output

No response

Issue description

Filtering by pl.col().is_null() or pl.col().is_not_null() before collecting gives me a non-deterministic wrong result.

I really tried to get a completely reproducible example, but did not succeed. Here's what I tried:

  • make a small dataframe with some nulls -> convert to lazy -> run the code above 1000s of times
  • save the dataframe to delta -> load with pl.scan_delta -> run the code above also 1000s of times

In both these cases, the results were consistently correct.

However, for the example in my data, something is clearly wrong.

Expected behavior

In the code snippet above, I'm filtering a lazyframe by null values in a column, and printing out the shape of the output. I'm doing that in 3 different ways:

  1. collect and then filter
  2. filter and then collect
  3. filter on a boolean column that represents the same filter, and then collect.

I expected that all 3 of these to give the exact same result. However, the filtering in nr 2 only works sometimes.

Installed versions

--------Version info---------
Polars:               0.20.10
Index type:           UInt32
Platform:             Linux-5.15.0-1043-aws-x86_64-with-glibc2.31
Python:               3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            0.15.3
fsspec:               2024.2.0
gevent:               24.2.1
hvplot:               <not installed>
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.0.1
pyarrow:              15.0.0
pydantic:             1.10.14
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.51
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

dcferreira avatar Feb 19 '24 21:02 dcferreira