table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column
Apache Iceberg version
0.9.0 (latest release)
Please describe the bug 🐞
Hi, thanks for writing pyiceberg.
The bug is pretty much described in the title: table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column.
Here is a reproducer:
pip install pyiceberg[sql-sqlite,pyarrow]
from pathlib import Path
from tempfile import TemporaryDirectory
import pyarrow
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.transforms import IdentityTransform
from pyiceberg.types import DoubleType, NestedField
from pyiceberg.partitioning import PartitionSpec, PartitionField
schema = Schema(
NestedField(field_id=1, name="x", field_type=DoubleType()),
NestedField(field_id=2, name="y", field_type=DoubleType()),
)
partition_spec = PartitionSpec(PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name="x"))
with TemporaryDirectory() as tmpdir:
catalog = SqlCatalog(
"local",
uri=f"sqlite:///{tmpdir}/catalog.db",
warehouse=f"file://{tmpdir}/warehouse",
)
catalog.create_namespace("test")
table = catalog.create_table(
"test.test", schema=schema, partition_spec=partition_spec
)
data = pyarrow.table(
{
"x": [0.0, 1.0, 2.0],
"y": [0.0, 0.0, 0.0],
}
)
table.overwrite(data)
print("=== no filter ===")
print(table.scan().to_arrow())
print("=== x IN (0) ===")
print(table.scan(row_filter="x IN (0)").to_arrow())
print("=== x IN (0, 1, 2) ===")
print(table.scan(row_filter="x IN (0, 1, 2)").to_arrow())
Output:
/tmp/tmp.l2MLQFjC7C-05duO9h5/lib/python3.13/site-packages/pyiceberg/table/__init__.py:686: UserWarning: Delete operation did not match any records
warnings.warn("Delete operation did not match any records")
=== no filter ===
pyarrow.Table
x: double
y: double
----
x: [[0],[1],[2]]
y: [[0],[0],[0]]
=== x IN (0) ===
pyarrow.Table
x: double
y: double
----
x: [[0]]
y: [[0]]
=== x IN (0, 1, 2) ===
pyarrow.Table
x: double
y: double
----
x: [[1],[2]]
y: [[0],[0]]
I expect output for x in (0, 1, 2) to match that of the no filter scan.
Note that I could not reproduce when x is a LongType instead of a DoubleType.
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
I did a little digging and just to be safe also tested table.scan(row_filter=In("x", [0.0, 1.0, 2.0])) which results in the same issue. I do however believe that this is happening when filter is pushed down to the parquet reading, as iceberg from what I can tell makes the correct pyarrow schema inside of _task_to_record_batches.
I believe this is the case because when I print each batch in batches = fragment_scanner.to_batches() I see the following output:
Empty DataFrame
Columns: [x, y]
Index: []
x y
0 1.0 0.0
x y
0 2.0 0.0
Note that we never see a value of 0.0 for some x which means that the fragment_scanner = ds.Scanner.from_fragment call which is pushing down the query to arrow is likely the culprit
You are correct. Upon closer inspection, the root cause appears to be in pyarrow: https://github.com/apache/arrow/issues/46183.
Thanks
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.