iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Table scan using functional filters

Open bigluck opened this issue 2 years ago • 3 comments

Feature Request / Improvement

Ciao @Fokko Seems like table.scan() supports a limited set of filter conditions, and it fails when a user specifies a complex one.

In my case, I have this input query:

SELECT * FROM wind_energy_sensor_data where observed_at::date = '2020-03-03'

Once mapped into a list of fields and filters using sqlglot, I get:

        {
            "filter": "CAST(observed_at AS DATE) = '2020-03-03'",
            "name": "wind_energy_sensor_data",
            "projections":
            [
                "*",
                "observed_at"
            ]
        }

But then when I pass the filter to the table.scan() function, it raises this ParseException:

     scan = table.scan(
            ^^^^^^^^^^^
   File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 473, in scan
     return DataScan(
            ^^^^^^^^^
   File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 773, in __init__
     super().__init__(table, row_filter, selected_fields, case_sensitive, snapshot_id, options, limit)
   File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 629, in __init__
     self.row_filter = _parse_row_filter(row_filter)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/pip/pyiceberg/pyiceberg/table/__init__.py", line 603, in _parse_row_filter
     return parser.parse(expr) if isinstance(expr, str) else expr
            ^^^^^^^^^^^^^^^^^^
   File "/pip/pyiceberg/pyiceberg/expressions/parser.py", line 270, in parse
     return boolean_expression.parse_string(expr, parse_all=True)[0]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/pip/pyparsing/pyparsing/core.py", line 1197, in parse_string
     raise exc.with_traceback(None)
 pyparsing.exceptions.ParseException: Expected expr, found '('  (at char 4), (line:1, col:5)

How hard is it to extend support for functional filters (as CAST() in my case)? And more importantly, is it something that makes sense to have, or pyiceberg (and in general iceberg) expect these types of conditions in a different format?

thanks, Luca

bigluck avatar Nov 29 '23 13:11 bigluck

Buongiorno @bigluck, great seeing you here again. Looks like we can break it down into smaller chunks:

  • For the stats evaluation it looks like we need to add the BoundTransform to the left-hand side. When the table is partitioned by day(observed_at), then we can use it to do the optimization (the transform can be dropped from both sides).
  • We also would need logic to convert the expression into an Arrow one: day(observed_at) => ds.field('observed_at').cast('date') == '2020-03-03'
  • Extend the parser.py to add the CAST(...) syntax.

Fokko avatar Nov 29 '23 21:11 Fokko

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar May 28 '24 00:05 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jun 12 '24 00:06 github-actions[bot]

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Dec 10 '24 00:12 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Dec 25 '24 00:12 github-actions[bot]