datafusion
datafusion copied to clipboard
Support simplification that requires multiple applications of constant folding / simplification
Is your feature request related to a problem or challenge? Please describe what you are trying to do. The constant folding and constant evaluation in constant_folding.rs(and added to in https://github.com/apache/arrow-datafusion/pull/1153) can not fold certain types of expressions where it takes
For example there is an expression like this in https://github.com/apache/arrow-datafusion/pull/1153
now() < cast(to_timestamp(...) as int) + 5000000000
It is entirely evaluateable at query time, however, as formulated in #1153 it will not be folded because it requires the constant evaluator to run after the simplifier (because the simplifier can fill in now()` and then the evaluator will fill in the rest).
However, there are other types of exprs such as (true or false) != col
which need to have the constant evaluator run before the simplifier to be fully simplified.
It is a more general problem where running either the Simplifier
or ConstEvaluator
could allow a second pass of the other to proceed
Describe the solution you'd like In general this is typically handled with some sort of 'fixed point' algorithm where there is a loop that runs the two passes until the Expr is not changed. It may also be possible to combine the Simplifier pass with the constant rewriter.
Describe alternatives you've considered I think there are two possible approaches I can think if:
- applying constant eval / simplifier in a loop until no changes are detected
- Combining the
ConstantEval
andSimplify
steps into a single pass
Additional context https://github.com/apache/arrow-datafusion/pull/1153
Is this still relevant after #1175 (PR #1176) ? If you think it is, do you have an example in mind ? I can't come up with one 😃
@rdettai I think it is still relevant (though the now()
example will work after #1175)
One (silly) example
(expr != NULL) OR (5 > 10)
We need to apply at least one pass twice to fully simplify
(expr != NULL) OR (5 > 10)
(expr != NULL) OR FALSE # after constant evaluation
NULL OR FALSE # after simplify
NULL # after second constant evaluation
In this this case if we applied simplify first and then constant evaluation the expression would be folded completely
However, for other examples if you apply simplification first that is insufficient
expr != (5 > NULL)
expr != (5 > NULL) # after simplification
expr != NULL # after constant evaluation
(would need a second call to simplify now)
Great thanks! This is very clear now. So if I understand correctly, this problem has two solutions:
- walk the tree once, and at each node apply either expression folding or constant evaluation
- walk the tree multiple times, alternating between expression folding and constant evaluation
The solution 1 (similar to what @pjmore proposed in #1128) would likely be more expensive in terms of space because more state needs no be tracked in the subtrees, and more critically would result in a much more complex algorithm. So we have opted for solution 2, which will be easier to implement and test, and the time overhead of walking the tree multiple times will likely be negligible as expression trees are usually rather small (and will grow even smaller at each iteration).
(I think this is already what you mentioned in https://github.com/apache/arrow-datafusion/pull/1128#pullrequestreview-782535004, but I thought a recap wouldn't hurt 😄)
(I think this is already what you mentioned in #1128 (review), but I thought a recap wouldn't hurt 😄)
👍 it is a nice summary;
I am interested in working on this and #3770 as I have a use case for a capability to simplify deeply nested expressions.
E.g.
(((col - 10) + 10) *100) / 100
Simplifies to just
col
Also something like
WHERE ((x<1 or y<2) and z<3) and false
is just
WHERE false
Given the age of these discussions, I am wondering if any of the context or recommended approaches may have changed. @alamb any new thoughts on this?
Given the age of these discussions, I am wondering if any of the context or recommended approaches may have changed. @alamb any new thoughts on this?
I think algorithms for such order dependent simplifications look something like:
do {
expr = simplify(expr)
} until expr is unchanged OR some fixed number of iterations is reached
Perhaps we can use the API that @peter-toth added in https://github.com/apache/arrow-datafusion/pull/8891 to check if/when the expr is unchanged.
The rewrite API for a long time did not have a way to know when the expression was actually rewritten (and thus when to terminate the iteration). I think we have it now
PR for this is ready for review https://github.com/apache/datafusion/pull/10358.
Maybe as a follow up feature we should expose the maximum number of iterations as a configuration parameter?