Table scan using functional filters
Feature Request / Improvement
Ciao @Fokko
Seems like table.scan() supports a limited set of filter conditions, and it fails when a user specifies a complex one.
In my case, I have this input query:
SELECT * FROM wind_energy_sensor_data where observed_at::date = '2020-03-03'
Once mapped into a list of fields and filters using sqlglot, I get:
{
"filter": "CAST(observed_at AS DATE) = '2020-03-03'",
"name": "wind_energy_sensor_data",
"projections":
[
"*",
"observed_at"
]
}
But then when I pass the filter to the table.scan() function, it raises this ParseException:
scan = table.scan(
^^^^^^^^^^^
File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 473, in scan
return DataScan(
^^^^^^^^^
File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 773, in __init__
super().__init__(table, row_filter, selected_fields, case_sensitive, snapshot_id, options, limit)
File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 629, in __init__
self.row_filter = _parse_row_filter(row_filter)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 603, in _parse_row_filter
return parser.parse(expr) if isinstance(expr, str) else expr
^^^^^^^^^^^^^^^^^^
File "/pip/pyiceberg/pyiceberg/expressions/parser.py", line 270, in parse
return boolean_expression.parse_string(expr, parse_all=True)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pip/pyparsing/pyparsing/core.py", line 1197, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: Expected expr, found '(' (at char 4), (line:1, col:5)
How hard is it to extend support for functional filters (as CAST() in my case)? And more importantly, is it something that makes sense to have, or pyiceberg (and in general iceberg) expect these types of conditions in a different format?
thanks, Luca
Buongiorno @bigluck, great seeing you here again. Looks like we can break it down into smaller chunks:
- For the stats evaluation it looks like we need to add the BoundTransform to the left-hand side. When the table is partitioned by
day(observed_at), then we can use it to do the optimization (the transform can be dropped from both sides). - We also would need logic to convert the expression into an Arrow one:
day(observed_at) => ds.field('observed_at').cast('date') == '2020-03-03' - Extend the
parser.pyto add theCAST(...)syntax.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'