datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

consolidate all the `InList` rewrites into a single pass

Open guojidan opened this issue 1 year ago • 7 comments

          Thank you @guojidan  -- this looks good to me. I think as a follow on PR it would be nice to consolidate all the `InList` rewrites into a single pass (rather than `ShortenInListSimplifier` and `InListSimplifier` but this PR seems like a good step in that direction

Thanks again

Originally posted by @alamb in https://github.com/apache/arrow-datafusion/pull/9037#pullrequestreview-1861422186

guojidan avatar Feb 06 '24 01:02 guojidan

I am not sure if @guojidan plans to work on this one. If not I think it would make a good first issue for someone

alamb avatar Feb 06 '24 02:02 alamb

I don't have much time recently because the Spring Festival holiday has arrived. So If everyone is interested, let's implement this 😄

guojidan avatar Feb 06 '24 02:02 guojidan

@guojidan @alamb , would love to help here.thx. 😄

manoj-inukolunu avatar Feb 06 '24 05:02 manoj-inukolunu

Hello @alamb @guojidan , I was working through a PR for this . I have a question on ShortenInListSimplifier what is the significance of THRESHOLD_INLINE_INLIST ? and why is it limited to 3? Thank you.

manoj-inukolunu avatar Feb 11 '24 12:02 manoj-inukolunu

I think it is a heuristic that avoids pathalogical cases when there are large numbers constants in the IN list (sometimes machine generated SQL can have 1000s of elements)

For example:

https://github.com/apache/arrow-datafusion/blob/afb169cd069e0227fb0ef6d39f44d5eabbdc21a2/datafusion/optimizer/src/simplify_expressions/inlist_simplifier.rs#L53-L59

(it would be great if you had a chance to add some comments with rationale)

alamb avatar Feb 11 '24 12:02 alamb

Thank you @alamb , I will add a comment. I attempted to do the in list simplifier in one pass . I am not really sure if its possible in a single pass. Here is the test which is failing if i try that .

// c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6) -> c1 = 3 OR c1 = 6

After the first pass it simplifies to

c1 IN (3,6)

After the ShortenInListSimplifier it simplifies to

 c1 = 3 OR c1 = 6

If i try to combile the ShortenInListSimplifier with the InListSimplifier , in the first pass itself

c1 IN (3,6)

is getting simplified to c1 = 3 OR c1 = 6 according to the ShortenInListSimplifier which is breaking the test . Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead. There are additional tests too which are failing too if we try a single pass although this is the one which majorly stood out.

manoj-inukolunu avatar Feb 14 '24 13:02 manoj-inukolunu

If i try to combile the ShortenInListSimplifier with the InListSimplifier , in the first pass itself

Ths sounds like the logic for the two simplifications is being invoked in a different order (or maybe only one is being invoked? 🤔 ). Could we perhaps invoke both simplications but invoke the code to simplify first and then the code to shorten?

Additionaly i attempted a benchmark , the additional pass seems to be adding around 6% overhead.

That is good data and more reason I think to consolidate the passes. Thank you

alamb avatar Feb 20 '24 07:02 alamb

hi @manoj-inukolunu , ShortenInListSimplifier should be executed after InListSimplifier has executed. one Expr will be executed recursively at each Simplifier example:

c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) AND c1 IN (3,6)

InListSimplifier

  1. c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
  2. c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  3. c1 IN (3,6) -> c1 IN (3,6)
  4. c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  5. c1 IN (1,3,5,6) AND c1 IN (3,6)
  6. c1 IN (3,6)

ShortenInListSimplifier 8. c1 = 3 OR c1 = 6

if we combile the ShortenInListSimplifier with the InListSimplifier, it will become below workflow:

  1. c1 IN (1,2,3,4,5,6) -> c1 IN (1,2,3,4,5,6)
  2. c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  3. c1 IN (3,6) -> c1 = 3 OR c1 = 6
  4. c1 IN (1,2,3,4,5,6) AND c1 IN (1,3,5,6) -> c1 IN (1,3,5,6)
  5. c1 IN (1,3,5,6) AND c1 = 3 OR c1 = 6

maybe you can view older version inlist simplifier logical(before my pr), I can got more inspiration

guojidan avatar Feb 28 '24 07:02 guojidan

So, the key of this issue is how to determine the time to execute ShortenInListSimplifier, Separate execution is the simplest 🤔

guojidan avatar Feb 28 '24 07:02 guojidan

@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.

jayzhan211 avatar Mar 15 '24 03:03 jayzhan211

@manoj-inukolunu @guojidan Is anyone working on this currently? If not, I will take a a look on this issue.

I am not working on this, thank you

guojidan avatar Mar 15 '24 04:03 guojidan

FYI @jayzhan211 I had started playing with this at some point but didn't finish

One approach might be to port individual parts of the inlist rewrite in multiple PRs (the entire logic was quite tricky, so doing it incrementally would likely be easier to code and easier to review)

alamb avatar Mar 15 '24 14:03 alamb

port individual parts of the inlist rewrite in multiple PRs

@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier

I'm looking into the issue of moving short inlist, and I found that it may not able to move into the inlist simplifier. short inlist into orlist simplifier should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up function. Do you have an idea to merge these rules?

jayzhan211 avatar Mar 15 '24 14:03 jayzhan211

@alamb Are you talking about removing cloned or moving short inlist into orlist simplifier

I was thinking in general to simplify the problem by moving as many rules as possible into the large match statement here

https://github.com/apache/arrow-datafusion/blob/81b0a011705b17a09f494f550a5382b0c3414597/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L630

short inlist into orlist simplifier should be started after the other 3 rules have been done recursively But, it does not seem there is a way to enforce the order in one f_up function.

I agree -- f_up gets run on each Expr of the tree so you can't express "apply three rules after this other"

So maybe we could move the three rules first into the large match and then figure out what to do with the short inlist into orlist simplifier rule

alamb avatar Mar 15 '24 14:03 alamb