iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column

Open ypsah opened this issue 8 months ago • 3 comments

Apache Iceberg version

0.9.0 (latest release)

Please describe the bug 🐞

Hi, thanks for writing pyiceberg.

The bug is pretty much described in the title: table.scan(row_filter="x IN (0, 1)") does not include the values for which x=0 when x is a DoubleType and a partition column.

Here is a reproducer:

pip install pyiceberg[sql-sqlite,pyarrow]
from pathlib import Path
from tempfile import TemporaryDirectory

import pyarrow
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.transforms import IdentityTransform
from pyiceberg.types import DoubleType, NestedField
from pyiceberg.partitioning import PartitionSpec, PartitionField

schema = Schema(
    NestedField(field_id=1, name="x", field_type=DoubleType()),
    NestedField(field_id=2, name="y", field_type=DoubleType()),
)
partition_spec = PartitionSpec(PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name="x"))

with TemporaryDirectory() as tmpdir:
    catalog = SqlCatalog(
        "local",
        uri=f"sqlite:///{tmpdir}/catalog.db",
        warehouse=f"file://{tmpdir}/warehouse",
    )
    catalog.create_namespace("test")
    table = catalog.create_table(
        "test.test", schema=schema, partition_spec=partition_spec
    )

    data = pyarrow.table(
        {
            "x": [0.0, 1.0, 2.0],
            "y": [0.0, 0.0, 0.0],
        }
    )
    table.overwrite(data)

    print("=== no filter ===")
    print(table.scan().to_arrow())
    print("=== x IN (0) ===")
    print(table.scan(row_filter="x IN (0)").to_arrow())
    print("=== x IN (0, 1, 2) ===")
    print(table.scan(row_filter="x IN (0, 1, 2)").to_arrow())

Output:

/tmp/tmp.l2MLQFjC7C-05duO9h5/lib/python3.13/site-packages/pyiceberg/table/__init__.py:686: UserWarning: Delete operation did not match any records
  warnings.warn("Delete operation did not match any records")
=== no filter ===
pyarrow.Table
x: double
y: double
----
x: [[0],[1],[2]]
y: [[0],[0],[0]]
=== x IN (0) ===
pyarrow.Table
x: double
y: double
----
x: [[0]]
y: [[0]]
=== x IN (0, 1, 2) ===
pyarrow.Table
x: double
y: double
----
x: [[1],[2]]
y: [[0],[0]]

I expect output for x in (0, 1, 2) to match that of the no filter scan.

Note that I could not reproduce when x is a LongType instead of a DoubleType.

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

ypsah avatar Apr 19 '25 16:04 ypsah

I did a little digging and just to be safe also tested table.scan(row_filter=In("x", [0.0, 1.0, 2.0])) which results in the same issue. I do however believe that this is happening when filter is pushed down to the parquet reading, as iceberg from what I can tell makes the correct pyarrow schema inside of _task_to_record_batches.

I believe this is the case because when I print each batch in batches = fragment_scanner.to_batches() I see the following output:

Empty DataFrame
Columns: [x, y]
Index: []
     x    y
0  1.0  0.0
     x    y
0  2.0  0.0

Note that we never see a value of 0.0 for some x which means that the fragment_scanner = ds.Scanner.from_fragment call which is pushing down the query to arrow is likely the culprit

jayceslesar avatar Apr 20 '25 02:04 jayceslesar

You are correct. Upon closer inspection, the root cause appears to be in pyarrow: https://github.com/apache/arrow/issues/46183.

Thanks

ypsah avatar Apr 20 '25 16:04 ypsah

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Nov 12 '25 00:11 github-actions[bot]