iceberg-python
iceberg-python copied to clipboard
feat(expressions/visitor): add support for bool
Feature Request / Improvement
I came across this corner case when trying to construct filters additively.
pl.lit(True) was used as a base expression to which other filters were ANDed.
(
pl.scan_iceberg(table, storage_options=storage_options)
.filter(pl.lit(False))
.explain()
)
>
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 6267713928597070305
IcebergDataset: to_dataset_scan(): begin path expansion
File "python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: False
(
pl.scan_iceberg(table, storage_options=storage_options)
.filter(pl.lit(True))
.explain()
)
>
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 6267713928597070305
IcebergDataset: to_dataset_scan(): begin path expansion
File "python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: True
Version
pl.show_versions()
--------Version info---------
Polars: 1.35.1
Index type: UInt64
Platform: macOS-15.7.1-arm64-arm-64bit
Python: 3.12.9 (main, Feb 4 2025, 14:38:38) [Clang 16.0.0 (clang-1600.0.26.6)]
Runtime: rt64
----Optional dependencies----
Azure CLI 2.70.0
adbc_driver_manager <not installed>
altair <not installed>
azure.identity 1.15.0
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2024.3.1
gevent <not installed>
google.auth 2.23.4
great_tables <not installed>
matplotlib 3.10.5
numpy 2.2.0
openpyxl 3.1.2
pandas 2.3.1
polars_cloud <not installed>
pyarrow 21.0.0
pydantic 2.4.2
pyiceberg 0.10.0
sqlalchemy 2.0.23
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
I see there's AlwaysTrue and AlwaysFalse visitors
https://github.com/apache/iceberg-python/blob/8878b2c260bf62ad16ef6f8e8d3a6ae93103eff7/pyiceberg/expressions/visitors.py#L156-L165
which i think should match the true/false literal
do you know how polars convert the .filter expression internally?
Hello @kevinjqliu, this might be a better reproducible example.
catalog.create_namespace_if_not_exists("tmp")
table = catalog.create_table_if_not_exists(
f"tmp.dummy3",
schema=Schema(
NestedField(field_id=1, name="ts", field_type=TimestamptzType(), required=False),
NestedField(field_id=2, name="id", field_type=StringType(), required=False),
),
partition_spec=PartitionSpec(
PartitionField(source_id=1, field_id=1000, transform=HourTransform(), name="ts_hour"),
),
)
# Create sample data with different hours to demonstrate partitioning
sample_data = pa.Table.from_pydict({
"ts": [
datetime(2025, 1, 15, 10, 30, 0, tzinfo=timezone.utc),
datetime(2025, 1, 15, 10, 45, 0, tzinfo=timezone.utc),
datetime(2025, 1, 15, 11, 15, 0, tzinfo=timezone.utc),
datetime(2025, 1, 15, 12, 20, 0, tzinfo=timezone.utc),
datetime(2025, 1, 15, 12, 50, 0, tzinfo=timezone.utc),
],
"id": ["id_001", "id_002", "id_003", "id_004", "id_005"],
})
table.append(sample_data)
pl.scan_iceberg(table).filter(pl.lit(False)).explain()
python dataset: convert from arrow schema
expand_datasets(): python[IcebergDataset]: limit: None, project: all
IcebergDataset: to_dataset_scan(): snapshot ID: None, limit: None, projection: None, filter_columns: [], pyarrow_predicate: Some(<redacted>), iceberg_table_filter: Some(<redacted>), self._use_metadata_statistics: True
IcebergDataset: to_dataset_scan(): tbl.metadata.current_snapshot_id: 784574589744330117
IcebergDataset: to_dataset_scan(): begin path expansion
Traceback (most recent call last):
File "<redacted>/python3.12/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-5-07bec56b4d3f>", line 1, in <module>
pl.scan_iceberg(table).filter(pl.lit(False)).explain()
File "<redacted>/python3.12/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/polars/lazyframe/opt_flags.py", line 328, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/polars/lazyframe/frame.py", line 1384, in explain
return ldf.describe_optimized_plan()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/polars/io/iceberg/dataset.py", line 85, in to_dataset_scan
scan_data := self._to_dataset_scan_impl(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/polars/io/iceberg/dataset.py", line 254, in _to_dataset_scan_impl
for i, file_info in enumerate(scan.plan_files()):
^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1924, in plan_files
if manifest_evaluators[manifest_file.partition_spec_id](manifest_file)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/typedef.py", line 72, in __missing__
val = self.default_factory(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1843, in _build_manifest_evaluator
return manifest_evaluator(spec, self.table_metadata.schema(), self.partition_filters[spec_id], self.case_sensitive)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/typedef.py", line 72, in __missing__
val = self.default_factory(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/table/__init__.py", line 1835, in _build_partition_projection
return project(self.row_filter)
^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 811, in project
return visit(bind(self.schema, rewrite_not(expr), self.case_sensitive), self)
^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 430, in rewrite_not
return visit(expr, _RewriteNotVisitor())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.12.9/Frameworks/Python.framework/Versions/3.12/lib/python3.12/functools.py", line 912, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<redacted>/python3.12/site-packages/pyiceberg/expressions/visitors.py", line 153, in visit
raise NotImplementedError(f"Cannot visit unsupported expression: {obj}")
NotImplementedError: Cannot visit unsupported expression: False