iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Parquet: Make row-group filters cooperate to filter

Open zhongyujiang opened this issue 1 year ago • 1 comments

This PR refactors three Parquet row-group filters into a form that computes residual expressions, allowing it to return a residual expression for the given row-groups. The residual computed by the previous filter can be passed to the next filter, allowing the three Parquet row-group filters to work together. This improves the handling of some OR condition queries.

For example:
Let's assume we have a query a = 'foo' OR b = 'bar', where column a is dictionary-encoded in a Parquet row-group, while column b is not entirely dictionary-encoded in all data pages but has a bloom filter. Therefore, a = 'foo' can only be evaluated by the dictionary filter, and b = 'bar' can only be evaluated by the bloom filter. In the current situation, even if both filters evaluate the expressions as ROWS_CANNOT_MATCH individually, because each filter can only evaluate one sub-expression, the final output would still be ROWS_MIGHT_MATCH (let's assume the metric filter evaluates both sub-expressions as ROWS_MIGHT_MATCH). After refactoring into the form of computing residuals, the dictionary filter will compute the residual for a = 'foo' OR b = 'bar' as b = 'bar'. Then this residual expression will be passed to the bloom filter and evaluated as Expressions.alwaysFalse(). As a result, the reading of this row-group can be skipped.

This is a revive of #6893, and can close #10029.

cc @cccs-jc @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you please review this? Thanks!

zhongyujiang avatar Apr 07 '24 02:04 zhongyujiang

@zhongyujiang I'm really sorry for the delayed review on my part. I think this is an important improvement!

I will be taking a deeper look at the implementation this week. A while back, I did check this code out locally and run some tests on some samples to get confidence on correctness but of course we'll also want unit test coverage as much as reasonably possible.

amogh-jahagirdar avatar Jun 10 '24 06:06 amogh-jahagirdar

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Oct 25 '24 00:10 github-actions[bot]

Hey @amogh-jahagirdar @Fokko @nastra @danielcweeks, can you help review this when you have time? Thanks

zhongyujiang avatar Oct 25 '24 05:10 zhongyujiang