iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Optimize `plan_files` with filter in case whe it is fully evaluated on Iceberg metadata

Open srilman opened this issue 1 year ago • 2 comments

Feature Request / Improvement

I noticed that in DataScan.plan_files, when we apply filters at a partition and file metadata level, all we try to determine is whether the file has rows that never match the filter or some might match. However, we can also easily determine if all rows in the file match the filter. This can occur when a file can be fully determined on partitions or file metadata.

This could enable some additional optimizations in file scan planning (in order of complexity):

  • When the filter is always true for all output files, we can skip the row-level filter in DataScan.to_arrow
  • When the filter is determined to be always true at the partition filter level, we can skip filtering on the file level
  • We can split files between ROWS_MIGHT_MATCH and ROWS_ALWAYS_MATCH and do half-and-half (at both partition -> file and output side)
  • We can go even more extreme and partially evaluate / simplify filters based on metadata before passing to later steps. The evaluation would work like:
    • If a expression is determined to always be false on all rows in a file/partition, then replace with AlwaysFalse
    • Likewise, if an expression would always be true on all rows, replace with AlwaysTrue
    • Simplify as walking up the tree

srilman avatar Mar 02 '24 14:03 srilman

When the filter is always true for all output files, we can skip the row-level filter in DataScan.to_arrow

I have a somewhat working implementation of this optimization where I expand _InclusiveMetricsEvaluator to return 3 different values, ROWS_MIGHT_MATCH, ROWS_CANNOT_MATCH, and (the new one) ROWS_ALWAYS_MATCH. Updating the tests now, and will try to get a PR out.

srilman avatar Mar 02 '24 14:03 srilman

Hey @srilman Thanks for reaching out here. I'm aware of the potential optimization, but most query engines don't optimize to that level. I'm very curious about the PR. Feel free to ping me when you have a working example.

Fokko avatar Mar 06 '24 10:03 Fokko

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Sep 03 '24 00:09 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Sep 18 '24 00:09 github-actions[bot]