materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Join-Dependent Predicate Duplication

Open ggevay opened this issue 2 years ago • 5 comments

There are cases where a traditional predicate pushdown is not possible, but it's still possible to create an extra filter on some join inputs, and thereby reduce the data volume that goes into the join.

For example:

SELECT * FROM t1, t2 WHERE t1.f1 = t2.f1 AND ((t1.f2 = 3 AND t2.f2 = 4) OR (t1.f2 = 5 AND t2.f2 = 6));

Here, we can pre-filter t1 with t1.f2 = 3 OR t1.f2 = 5, and t2 with t2.f2 = 4 OR t2.f2 = 6.

Similar situations occur in TPC-H Q07, Q19, and chbench Q07, Q19.

Motivation

  • This PR adds a known-desirable feature: #14501, and #7409 (partially).

Tips for reviewer

In TPC-H Q19 and chbench Q19 some of the newly created filters get lifted to after the join by join_implementation.rs, which doesn't looks so good at first glance. However:

  • I presume that the rendering might still be able to push those to the join closure that is just after that input, which means that we'll filter with them before joining in other inputs.
  • Even if they end up in the final closure, evaluating the expression might still be slightly faster, because the lifted filters (fortunately) come before the original predicate, and they are simpler than the original predicate.

Checklist

  • [x] This PR has adequate test coverage / QA involvement has been duly considered.

  • [ ] This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-protobuf label.

  • [x] This PR includes the following user-facing behavior changes:

    • Performance improvement, see above.

ggevay avatar Sep 08 '22 20:09 ggevay

I presume that the rendering might still be able to push those to the join closure that is just after that input, which means that we'll filter with them before joining in other inputs.

You don't need to presume. You can create EXPLAIN PHYSICAL PLAN AS TEXT or EXPLAIN PHYSICAL PLAN AS JSON tests to see if the filters are in the join closure that is just after that input.

wangandi avatar Sep 18 '22 23:09 wangandi

@ggevay do you mind if I test this tomorrow?

philip-stoev avatar Sep 20 '22 11:09 philip-stoev

@philip-stoev That would be great, thx!

ggevay avatar Sep 20 '22 11:09 ggevay

Thanks, @wangandi! I addressed the comments.

I presume that the rendering might still be able to push those to the join closure that is just after that input, which means that we'll filter with them before joining in other inputs.

You don't need to presume. You can create EXPLAIN PHYSICAL PLAN AS TEXT or EXPLAIN PHYSICAL PLAN AS JSON tests to see if the filters are in the join closure that is just after that input.

Hm, good idea! I added a test with the physical plan of a delta join.

ggevay avatar Sep 20 '22 13:09 ggevay

Item No 1. This query has a Distinct and a Union that are missing in the prior commit.

materialize=> EXPLAIN  SELECT * FROM t2 AS a1 left JOIN t2 AS a2 USING ( f1 ) WHERE a1 . f1 = 1 AND a2 . f1 = 2  OR a1 .f2 = 3 AND a1 .f2 = 4;
%0 = Let l0 =
| Get materialize.public.t2 (u4)
| ArrangeBy (#0)

%1 = Let l1 =
| Join %0 %0 (= #0 #2)
| | implementation = DeltaQuery
| |   delta %0 %0.(#0)
| |   delta %0 %0.(#0)
| Filter (#0) IS NOT NULL
| Project (#0, #1, #3)

%2 =
| Get %1 (l1)
| Filter (((#0 = 2) AND (#0 = 1)) OR ((#1 = 3) AND (#1 = 4)))

%3 =
| Get %1 (l1)
| Map (#0 = 1), ((#1 = 3) AND (#1 = 4))
| Filter (#3 OR #4), (#4 OR (#3 AND null))
| Project (#0, #1)
| Distinct group=(#0, #1)
| Negate

%4 =
| Get materialize.public.t2 (u4)
| Map (#0 = 1), ((#1 = 3) AND (#1 = 4))
| Filter (#2 OR #3), (#3 OR (#2 AND null))
| Project (#0, #1)
| Distinct group=(#0, #1) <- THIS HERE

%5 =
| Union %3 %4 <- AND THIS HERE

%6 =
| Get materialize.public.t2 (u4)
| ArrangeBy (#1, #0)

%7 =
| Join %5 %6 (= #0 #2) (= #1 #3)
| | implementation = Differential %5 %6.(#1, #0)
| Map (#0 = 1), ((#1 = 3) AND (#1 = 4)), null
| Filter (#4 OR #5), (#5 OR (#4 AND null))
| Project (#0, #1, #6)

%8 =
| Union %2 %7

Dataset in https://github.com/MaterializeInc/materialize/issues/14905

philip-stoev avatar Sep 21 '22 13:09 philip-stoev

Item No 2. This query has a Map operator in this branch:

materialize=> EXPLAIN  SELECT * FROM t2 a1  JOIN t1  USING ( f1  )  WHERE (f1  IN (  1, 2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 , 10  )  AND a1 .f2  IN (  6  ,  4  ))  OR f1  IN (  0    )   ;
%0 =
| Get materialize.public.t2 (u4)
| ArrangeBy (#0)

%1 =
| Get materialize.public.t1 (u1)
| ArrangeBy (#0)

%2 =
| Join %0 %1 (= #0 #2)
| | implementation = DeltaQuery
| |   delta %0 %1.(#0)
| |   delta %1 %0.(#0)
| Map (#0 = 0), (#0 = 2), (#0 = 3), (#0 = 4), (#0 = 5), (#0 = 6), (#0 = 7), (#0 = 8), (#0 = 9), (#0 = 10), (#0 = 1)
| Filter (#0) IS NOT NULL, (#4 OR #5 OR #6 OR #7 OR #8 OR #9 OR #10 OR #11 OR #12 OR #13 OR #14), (#4 OR ((#5 OR #6 OR #7 OR #8 OR #9 OR #10 OR #11 OR #12 OR #13 OR #14) AND ((#1 = 4) OR (#1 = 6))))
| Project (#0, #1, #3)

So I was wondering:

  • why are we not joining the "join-with-constant-source" trick that kicks in in other cases
  • With the Map all the comparisons between f1 and all the individual items in the IN list are performed, right? In this case, there is no short-cut that will avoid computing the rest of the comparisons if a given one evaluates to True.

philip-stoev avatar Sep 23 '22 08:09 philip-stoev

Thank you very much for the thorough testing @philip-stoev!

Item No 1.: This is actually an issue not in this PR, but in the literal constraints detection, so I will open a separate PR to fix it.

Item No 2.: At the moment, we are doing the "join-with-constant-source" trick only when the filter (or actually the MFP) is right after a Get. Predicate pushdown makes this to be the case:

Project (#0, #1, #3)
  Join on=(#0 = #2)
    Filter (#0) IS NOT NULL AND ((#1 = 4) OR (#1 = 6)) AND ((#0 = 2) OR (#0 = 3) OR (#0 = 4) OR (#0 = 5) OR (#0 = 6) OR (#0 = 7) OR (#0 = 8) OR (#0 = 9) OR (#0 = 10) OR (#0 = 1))
      Get materialize.public.t2
    Filter (#0) IS NOT NULL AND ((#0 = 2) OR (#0 = 3) OR (#0 = 4) OR (#0 = 5) OR (#0 = 6) OR (#0 = 7) OR (#0 = 8) OR (#0 = 9) OR (#0 = 10) OR (#0 = 1))
      Get materialize.public.t1

But then join_implementation lifts the filters, because it wants to use the existing indexes for a delta join, and if there are extra filters there, then it cannot use the existing index. So after join_implementation, the filter is no longer right after the Get.

Project (#0, #1, #3)
  Project (#0..=#3)
    Filter (#0) IS NOT NULL AND ((#0 = 2) OR (#0 = 3) OR (#0 = 4) OR (#0 = 5) OR (#0 = 6) OR (#0 = 7) OR (#0 = 8) OR (#0 = 9) OR (#0 = 10) OR (#0 = 1)) AND ((#1 = 4) OR (#1 = 6)) AND (#2) IS NOT NULL AND ((#2 = 2) OR (#2 = 3) OR (#2 = 4) OR (#2 = 5) OR (#2 = 6) OR (#2 = 7) OR (#2 = 8) OR (#2 = 9) OR (#2 = 10) OR (#2 = 1))
      Map ()
        Join on=(#0 = #2) type=delta
          ArrangeBy keys=[[#0]]
            Get materialize.public.t2
          ArrangeBy keys=[[#0]]
            Get materialize.public.t1

I wrote a comment now about this here: https://github.com/MaterializeInc/materialize/issues/14500#issuecomment-1257179631

ggevay avatar Sep 25 '22 12:09 ggevay