datafusion
datafusion copied to clipboard
consolidate all the `InList` rewrites into a single pass
Thank you @guojidan -- this looks good to me. I think as a follow on PR it would be nice to consolidate all the `InList` rewrites into a single pass (rather than `ShortenInListSimplifier` and `InListSimplifier` but this PR seems like a good step in that direction
Thanks again
Originally posted by @alamb in https://github.com/apache/arrow-datafusion/pull/9037#pullrequestreview-1861422186
I am not sure if @guojidan plans to work on this one. If not I think it would make a good first issue for someone
I don't have much time recently because the Spring Festival holiday has arrived. So If everyone is interested, let's implement this 😄
@guojidan @alamb , would love to help here.thx. 😄
Hello @alamb @guojidan , I was working through a PR for this . I have a question on ShortenInListSimplifier
what is the significance of THRESHOLD_INLINE_INLIST
? and why is it limited to 3? Thank you.
I think it is a heuristic that avoids pathalogical cases when there are large numbers constants in the IN
list (sometimes machine generated SQL can have 1000s of elements)
For example:
https://github.com/apache/arrow-datafusion/blob/afb169cd069e0227fb0ef6d39f44d5eabbdc21a2/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L53-L59
(it would be great if you had a chance to add some comments with rationale)
Thank you @alamb , I will add a comment. I attempted to do the in list simplifier in one pass . I am not really sure if its possible in a single pass. Here is the test which is failing if i try that .
// c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6) -> c1 = 3 OR c1 = 6
After the first pass it simplifies to
c1 IN (3,6)
After the ShortenInListSimplifier
it simplifies to
c1 = 3 OR c1 = 6
If i try to combile the ShortenInListSimplifier
with the InListSimplifier
, in the first pass itself
c1 IN (3,6)
is getting simplified to c1 = 3 OR c1 = 6
according to the ShortenInListSimplifier
which is breaking the test . Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead. There are additional tests too which are failing too if we try a single pass although this is the one which majorly stood out.
If i try to combile the ShortenInListSimplifier with the InListSimplifier , in the first pass itself
Ths sounds like the logic for the two simplifications is being invoked in a different order (or maybe only one is being invoked? 🤔 ). Could we perhaps invoke both simplications but invoke the code to simplify first and then the code to shorten?
Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead.
That is good data and more reason I think to consolidate the passes. Thank you
hi @manoj-inukolunu , ShortenInListSimplifier
should be executed after InListSimplifier
has executed. one Expr will be executed recursively at each Simplifier
example:
c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6)
InListSimplifier
- c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
- c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
- c1 IN (3,6) -> c1 IN (3,6)
- c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
- c1 IN (1,3,5,6) AND c1 IN (3,6)
- c1 IN (3,6)
ShortenInListSimplifier
8. c1 = 3 OR c1 = 6
if we combile the ShortenInListSimplifier
with the InListSimplifier
, it will become below workflow:
- c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
- c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
- c1 IN (3,6) -> c1 = 3 OR c1 = 6
- c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
- c1 IN (1,3,5,6) AND c1 = 3 OR c1 = 6
maybe you can view older version inlist simplifier logical(before my pr), I can got more inspiration
So, the key of this issue is how to determine the time to execute ShortenInListSimplifier
, Separate execution is the simplest 🤔
@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.
@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.
I am not working on this, thank you
FYI @jayzhan211 I had started playing with this at some point but didn't finish
One approach might be to port individual parts of the inlist rewrite in multiple PRs (the entire logic was quite tricky, so doing it incrementally would likely be easier to code and easier to review)
port individual parts of the inlist rewrite in multiple PRs
@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier
I'm looking into the issue of moving short inlist, and I found that it may not able to move into the inlist simplifier. short inlist into orlist simplifier
should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up
function. Do you have an idea to merge these rules?
@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier
I was thinking in general to simplify the problem by moving as many rules as possible into the large match
statement here
https://github.com/apache/arrow-datafusion/blob/81b0a011705b17a09f494f550a5382b0c3414597/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L630
short inlist into orlist simplifier should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up function.
I agree -- f_up gets run on each Expr of the tree so you can't express "apply three rules after this other"
So maybe we could move the three rules first into the large match
and then figure out what to do with the short inlist into orlist simplifier
rule