iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

feat(expressions/visitor): add support for bool

Open HynekBlaha opened this issue 2 months ago • 2 comments

Feature Request / Improvement

I came across this corner case when trying to construct filters additively. pl.lit(True) was used as a base expression to which other filters were ANDed.


(
    pl.scan_iceberg(table, storage_options=storage_options)
    .filter(pl.lit(False))
    .explain()
)
>
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 6267713928597070305
IcebergDataset: to_dataset_scan(): begin path expansion

File "python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
  raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: False
(
    pl.scan_iceberg(table, storage_options=storage_options)
    .filter(pl.lit(True))
    .explain()
)
>
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 6267713928597070305
IcebergDataset: to_dataset_scan(): begin path expansion

File "python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
  raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: True

Version

pl.show_versions()
--------Version info---------
Polars:              1.35.1
Index type:          UInt64
Platform:            macOS-15.7.1-arm64-arm-64bit
Python:              3.12.9 (main, Feb  4 2025, 14:38:38) [Clang 16.0.0 (clang-1600.0.26.6)]
Runtime:             rt64
----Optional dependencies----
Azure CLI            2.70.0
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       1.15.0
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.3.1
gevent               <not installed>
google.auth          2.23.4
great_tables         <not installed>
matplotlib           3.10.5
numpy                2.2.0
openpyxl             3.1.2
pandas               2.3.1
polars_cloud         <not installed>
pyarrow              21.0.0
pydantic             2.4.2
pyiceberg            0.10.0
sqlalchemy           2.0.23
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

HynekBlaha avatar Nov 03 '25 07:11 HynekBlaha

I see there's AlwaysTrue and AlwaysFalse visitors https://github.com/apache/iceberg-python/blob/8878b2c260bf62ad16ef6f8e8d3a6ae93103eff7/pyiceberg/expressions/visitors.py#L156-L165

which i think should match the true/false literal

do you know how polars convert the .filter expression internally?

kevinjqliu avatar Nov 03 '25 16:11 kevinjqliu

Hello @kevinjqliu, this might be a better reproducible example.

catalog.create_namespace_if_not_exists("tmp")
table = catalog.create_table_if_not_exists(
    f"tmp.dummy3",
    schema=Schema(
        NestedField(field_id=1, name="ts", field_type=TimestamptzType(), required=False),
        NestedField(field_id=2, name="id", field_type=StringType(), required=False),
    ),
    partition_spec=PartitionSpec(
        PartitionField(source_id=1, field_id=1000, transform=HourTransform(), name="ts_hour"),
    ),
)

# Create sample data with different hours to demonstrate partitioning
sample_data = pa.Table.from_pydict({
    "ts": [
        datetime(2025, 1, 15, 10, 30, 0, tzinfo=timezone.utc),
        datetime(2025, 1, 15, 10, 45, 0, tzinfo=timezone.utc),
        datetime(2025, 1, 15, 11, 15, 0, tzinfo=timezone.utc),
        datetime(2025, 1, 15, 12, 20, 0, tzinfo=timezone.utc),
        datetime(2025, 1, 15, 12, 50, 0, tzinfo=timezone.utc),
    ],
    "id": ["id_001", "id_002", "id_003", "id_004", "id_005"],
})

table.append(sample_data)

pl.scan_iceberg(table).filter(pl.lit(False)).explain()
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 784574589744330117
IcebergDataset: to_dataset_scan(): begin path expansion
Traceback (most recent call last):
  File "<redacted>/python3.12/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-07bec56b4d3f>", line 1, in <module>
    pl.scan_iceberg(table).filter(pl.lit(False)).explain()
  File "<redacted>/python3.12/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/polars/lazyframe/opt_flags.py", line 328, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/polars/lazyframe/frame.py", line 1384, in explain
    return ldf.describe_optimized_plan()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/polars/io/iceberg/dataset.py", line 85, in to_dataset_scan
    scan_data := self._to_dataset_scan_impl(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/polars/io/iceberg/dataset.py", line 254, in _to_dataset_scan_impl
    for i, file_info in enumerate(scan.plan_files()):
                                  ^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1924, in plan_files
    if manifest_evaluators[manifest_file.partition_spec_id](manifest_file)
       ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/typedef.py", line 72, in __missing__
    val = self.default_factory(key)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1843, in _build_manifest_evaluator
    return manifest_evaluator(spec, self.table_metadata.schema(), self.partition_filters[spec_id], self.case_sensitive)
                                                                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/typedef.py", line 72, in __missing__
    val = self.default_factory(key)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1835, in _build_partition_projection
    return project(self.row_filter)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 811, in project
    return visit(bind(self.schema, rewrite_not(expr), self.case_sensitive), self)
                                   ^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 430, in rewrite_not
    return visit(expr, _RewriteNotVisitor())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.12.9/Frameworks/Python.framework/Versions/3.12/lib/python3.12/functools.py", line 912, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
    raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: False

HynekBlaha avatar Nov 04 '25 20:11 HynekBlaha