Support "pre-image" for pruning predicate evaluation
Which issue does this PR close?
- Closes #18320.
Rationale for this change
What changes are included in this PR?
Adding a new rule to expr_simplifier library - udf_preimage
This rule performs the optimization for the following operators:
- [x] Equal
- [x] NotEqual
- [x] Greater
- [x] GreaterEqual
- [x] Less
- [x] LessEqual
- [x] IsDistinctFrom
- [x] IsNotDistinctFrom
The rule currently supports optimization for date_part function given 'year' literal is passed as the interval parameter.
Are these changes tested?
Tested for most comparison operators and all possible datatypes in unit tests and sqllogictests. Unit tests are missing:
- IsDistinctFrom
- IsNotDistinctFrom
Sqllogictests are missing:
- NULL as the literal compared to in the predicate expression
- full test case for IsDistinctFrom/IsNotDistinctFrom
Are there any user-facing changes?
@alamb This works as is right now, but I like the idea of adding a new method ScalarUDFImpl::simplify_predicate() as a unified api for all scalar functions. We then can move this code in the DatePartFunc implementation of ScalarUDFImpl and create a simplification rule for scalar functions.
The rule will check if there is a scalar function is present in the predicate expression and match the corresponding ScalarUDFImpl::simplify_predicate() to simplify the expression.
@alamb @2010YOUY01 Thanks for your reviews, I've implemented most of the suggested changes, but there are few nits here and there. I will self-review to point them out.
A major question I have about the current logic: should we support preimage for in_list predicate expressions?
Currently inlist_simplifier simplifies expr IN (A, B, ...) --> (expr = A) OR (expr = B) OR (expr = C)
if (list.len() is 1) or (list.len() less than 3 and the left expression is a column expression)
https://github.com/apache/datafusion/blob/c7b339e0b016ace1d3c337f756eb684ce4bc57b5/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L47-L55
Originally, I was following the unwrap_cast implementation and it's using a separate inlist simplification logic specifically for cast expressions:
https://github.com/apache/datafusion/blob/12cb4cae1ce9020ddfa2f9890f9f4e7c1a43fccb/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1913-L1961
I think that using existing inlist_simplifier would be more elegant and save space, but it would let other expressions to be simplified like this expr IN (A, B, ...) --> (expr = A) OR (expr = B) OR (expr = C)
I believe we could improve the logic here:
if !list.is_empty()
&& (
// For lists with only 1 value we allow more complex expressions to be simplified
// e.g SUBSTR(c1, 2, 3) IN ('1') -> SUBSTR(c1, 2, 3) = '1'
// for more than one we avoid repeating this potentially expensive
// expressions
list.len() == 1
|| list.len() <= THRESHOLD_INLINE_INLIST
&& expr.try_as_col().is_some()
|| list.len() <= THRESHOLD_INLINE_INLIST
&& matches!(**expr, Expr::ScalarFunction(ScalarFunction { func, args }))
&& func.preimage_support() // by adding a method to `ScalarUDFImpl` and `ScalarUDF`
)
This way, only scalar functions that explicitly support preimage will be simplified here.
Please let me know what you think about this approach.
Thank you again for pushing it forward! @sdf-jkl
To make the PRs easier to review, we typically want them to be small and focused, rather than being comprehensive.
I suggest we can
- Update the PR writeup for the idea to implement (I think it's this one? https://github.com/apache/datafusion/pull/18789#discussion_r2561194219)
- List the applicable cases in the writeup (like this one ab9b444 and others)
- In the initial PR only include the simplest applicable case, for example year extraction re-writing.
And we can iterate faster for this PR, and add other similar optimizations in the future.
Thank you again for pushing it forward! @sdf-jkl
To make the PRs easier to review, we typically want them to be small and focused, rather than being comprehensive.
I suggest we can
- Update the PR writeup for the idea to implement (I think it's this one? Support "pre-image" for pruning predicate evaluation #18789 (comment))
- List the applicable cases in the writeup (like this one ab9b444 and others)
- In the initial PR only include the simplest applicable case, for example year extraction re-writing.
And we can iterate faster for this PR, and add other similar optimizations in the future.
@2010YOUY01 Thanks for the heads up, this makes sense.
I will work on adding the sqllogictests to finish this PR and create separate issues for other improvements.
@alamb @2010YOUY01 I updated sqllogictests to include these changes:
The tests check:
- Simple optimization where col in LHS of the comparison
- Check that date_part is out of the explain statements for the same queries
- Simple optimizations, but col in RHS
- Check that different interval types for date_part are not optimized
- Check simple optimizations for different datatypes
- Check explain statements for the different datatype queries above
The tests do not check:
- NULL as the literal to compare to in the expression (null is not currently supported for this optimization)
- Full null-handling cases for
is distinct from/is not distinct fromdue to missing tests above - Edge cases?
What can be done in the future PRs is:
- add
inlistsupport - support
NULLas a literal to compare to for this optimization + improve sqllogictests - other suggestions?
@2010YOUY01 Should I restore this PR to a previous simpler commit to close and move the new commits to a separate branch?
@alamb @2010YOUY01 following up on this.
Thanks @sdf-jkl -- I'll try and review this later today
starting to check this one out
BTW if you want my active notes (aka I tried a few things to refine the API) you can see what I was trying in this commit: https://github.com/alamb/datafusion/commit/49057e929b5350dd821daf2083e0b51214fe4e86
Hey @alamb please check when you're available!
Thanks -- I'll try and find some time over the next day or two