datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support "pre-image" for pruning predicate evaluation

Open sdf-jkl opened this issue 1 month ago • 10 comments

Which issue does this PR close?

  • Closes #18320.

Rationale for this change

What changes are included in this PR?

Adding a new rule to expr_simplifier library - udf_preimage

This rule performs the optimization for the following operators:

  • [x] Equal
  • [x] NotEqual
  • [x] Greater
  • [x] GreaterEqual
  • [x] Less
  • [x] LessEqual
  • [x] IsDistinctFrom
  • [x] IsNotDistinctFrom

The rule currently supports optimization for date_part function given 'year' literal is passed as the interval parameter.

Are these changes tested?

Tested for most comparison operators and all possible datatypes in unit tests and sqllogictests. Unit tests are missing:

  • IsDistinctFrom
  • IsNotDistinctFrom

Sqllogictests are missing:

  • NULL as the literal compared to in the predicate expression
  • full test case for IsDistinctFrom/IsNotDistinctFrom

Are there any user-facing changes?

sdf-jkl avatar Nov 17 '25 21:11 sdf-jkl

@alamb This works as is right now, but I like the idea of adding a new method ScalarUDFImpl::simplify_predicate() as a unified api for all scalar functions. We then can move this code in the DatePartFunc implementation of ScalarUDFImpl and create a simplification rule for scalar functions.

The rule will check if there is a scalar function is present in the predicate expression and match the corresponding ScalarUDFImpl::simplify_predicate() to simplify the expression.

sdf-jkl avatar Nov 17 '25 21:11 sdf-jkl

@alamb @2010YOUY01 Thanks for your reviews, I've implemented most of the suggested changes, but there are few nits here and there. I will self-review to point them out.

A major question I have about the current logic: should we support preimage for in_list predicate expressions?

Currently inlist_simplifier simplifies expr IN (A, B, ...) --> (expr = A) OR (expr = B) OR (expr = C) if (list.len() is 1) or (list.len() less than 3 and the left expression is a column expression) https://github.com/apache/datafusion/blob/c7b339e0b016ace1d3c337f756eb684ce4bc57b5/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L47-L55

Originally, I was following the unwrap_cast implementation and it's using a separate inlist simplification logic specifically for cast expressions: https://github.com/apache/datafusion/blob/12cb4cae1ce9020ddfa2f9890f9f4e7c1a43fccb/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L1913-L1961

I think that using existing inlist_simplifier would be more elegant and save space, but it would let other expressions to be simplified like this expr IN (A, B, ...) --> (expr = A) OR (expr = B) OR (expr = C)

I believe we could improve the logic here:

if !list.is_empty()
                && (
                    // For lists with only 1 value we allow more complex expressions to be simplified
                    // e.g SUBSTR(c1, 2, 3) IN ('1') -> SUBSTR(c1, 2, 3) = '1'
                    // for more than one we avoid repeating this potentially expensive
                    // expressions
                    list.len() == 1
                        || list.len() <= THRESHOLD_INLINE_INLIST
                            && expr.try_as_col().is_some()
                        || list.len() <= THRESHOLD_INLINE_INLIST
                            && matches!(**expr, Expr::ScalarFunction(ScalarFunction { func, args })) 
                            && func.preimage_support() // by adding a method to `ScalarUDFImpl` and `ScalarUDF`
                )

This way, only scalar functions that explicitly support preimage will be simplified here.

Please let me know what you think about this approach.

sdf-jkl avatar Nov 21 '25 21:11 sdf-jkl

Thank you again for pushing it forward! @sdf-jkl

To make the PRs easier to review, we typically want them to be small and focused, rather than being comprehensive.

I suggest we can

  1. Update the PR writeup for the idea to implement (I think it's this one? https://github.com/apache/datafusion/pull/18789#discussion_r2561194219)
  2. List the applicable cases in the writeup (like this one ab9b444 and others)
  3. In the initial PR only include the simplest applicable case, for example year extraction re-writing.

And we can iterate faster for this PR, and add other similar optimizations in the future.

2010YOUY01 avatar Dec 02 '25 03:12 2010YOUY01

Thank you again for pushing it forward! @sdf-jkl

To make the PRs easier to review, we typically want them to be small and focused, rather than being comprehensive.

I suggest we can

  1. Update the PR writeup for the idea to implement (I think it's this one? Support "pre-image" for pruning predicate evaluation #18789 (comment))
  2. List the applicable cases in the writeup (like this one ab9b444 and others)
  3. In the initial PR only include the simplest applicable case, for example year extraction re-writing.

And we can iterate faster for this PR, and add other similar optimizations in the future.

@2010YOUY01 Thanks for the heads up, this makes sense.

I will work on adding the sqllogictests to finish this PR and create separate issues for other improvements.

sdf-jkl avatar Dec 02 '25 15:12 sdf-jkl

@alamb @2010YOUY01 I updated sqllogictests to include these changes:

The tests check:

  • Simple optimization where col in LHS of the comparison
  • Check that date_part is out of the explain statements for the same queries
  • Simple optimizations, but col in RHS
  • Check that different interval types for date_part are not optimized
  • Check simple optimizations for different datatypes
  • Check explain statements for the different datatype queries above

The tests do not check:

  • NULL as the literal to compare to in the expression (null is not currently supported for this optimization)
  • Full null-handling cases for is distinct from/is not distinct from due to missing tests above
  • Edge cases?

sdf-jkl avatar Dec 02 '25 20:12 sdf-jkl

What can be done in the future PRs is:

  • add inlist support
  • support NULL as a literal to compare to for this optimization + improve sqllogictests
  • other suggestions?

@2010YOUY01 Should I restore this PR to a previous simpler commit to close and move the new commits to a separate branch?

sdf-jkl avatar Dec 02 '25 20:12 sdf-jkl

@alamb @2010YOUY01 following up on this.

sdf-jkl avatar Dec 08 '25 14:12 sdf-jkl

Thanks @sdf-jkl -- I'll try and review this later today

alamb avatar Dec 08 '25 17:12 alamb

starting to check this one out

alamb avatar Dec 09 '25 20:12 alamb

BTW if you want my active notes (aka I tried a few things to refine the API) you can see what I was trying in this commit: https://github.com/alamb/datafusion/commit/49057e929b5350dd821daf2083e0b51214fe4e86

alamb avatar Dec 09 '25 22:12 alamb

Hey @alamb please check when you're available!

sdf-jkl avatar Dec 15 '25 18:12 sdf-jkl

Thanks -- I'll try and find some time over the next day or two

alamb avatar Dec 16 '25 12:12 alamb