hyperspace [WIP] New version of join index rule

What changes were proposed in this pull request?

Introducing a new optimizer rule JoinIndexRuleV2 for optimizing join rules. This rule is introduced to overcome the limitations of the original JoinIndexRule.

/**
 * Join Index Rule V2. This rule tries to optimizes both sides of a shuffle based join
 * independently. The optimization works by replacing data sources with bucketed indexes which
 * match the join predicate partitioning.
 *
 * Algorithm:
 * 1. Identify whether this join node can be optimized:
 *    a. We support only equi-joins in CNF forms. Also make sure the join columns are directly
 *       picked from scan nodes.
 *    b. This join is not a broadcast hash join. To check this, we independently check the left
 *       and right sides of the join and make sure their size is less than
 *       "spark.sql.autoBroadcastJoinThreshold"
 * 2. Independently check left and right sides of the join for available indexes. If an index
 *    is picked, the shuffle on that side will be eliminated.
 */

Why was this change introduced?

JoinIndexRule:

works on both sides of the join at the same time. It doesn't work independently on both sides, limiting the possibility of shuffle elimination on just one side wherever possible.
works only on the lowest level join. for broadcast joins, it doesn't improve performance much.
cannot intelligently find a higher level shuffle based join to optimize.

JoinIndexRuleV2:

works on either sides of the join independently. It can pick indexes to eliminate shuffle on just one side also.
optimizes shuffle based joins (e.g. SortMergeJoin), disregards broadcast join.
can work on higher level joins if one side of the join can be replaced by index and can eliminate shuffle.

E.g.
      SMJ
    /    \
  t1       BHJ
          /   \
        t2    t3
t1 can be replaced with t1Index to improve the SMJ. t2 can also be replaced with t2Index to improve the SMJ by propagating it's partitioning information up the BHJ.

To enable JoinIndexRuleV2: set spark conf "spark.hyperspace.enableJoinRuleV2" to true.

Does this PR introduce any user-facing change?

no

How was this patch tested?

unit tests added

Notes From Comments:

Question: is there any case that JoinRule v2 cannot cover but JoinRule v1 can? Could we replace v1 with v2?

Thanks @sezruby ,

V1 rule works on both sides of a join simultaneously so it can find better, more compatible indexes on both sides. V2 works independently on either sides and so may not find the most compatible pair of indexes.

There's one case where this affects: multi-column join query:

SELECT t1.a, t1.b, t2.c, t2.d
FROM t1, t2
WHERE t1.a = t2.c AND t1.b = t2.d

V1 will find an index where the columns are compatible.

T1Index: IndexColumns: a,b
T2Index: IndexColumns: c,d

V2 may choose non-matching indexes like this

T1Index: IndexColumns: a,b
T2Index: IndexColumns: d,c      >> note the reversed order of index  columns

THere's no guarantee that for multi-column indexes, the most compatible index pair is selected

Aug 14 '20 21:08 apoorvedave1

Question: is there any case that JoinRule v2 cannot cover but JoinRule v1 can? Could we replace v1 with v2?

Aug 27 '20 05:08 sezruby

Question: is there any case that JoinRule v2 cannot cover but JoinRule v1 can? Could we replace v1 with v2?

V1 rule works on both sides of a join simultaneously so it can find better, more compatible indexes on both sides. V2 works independently on either sides and so may not find the most compatible pair of indexes.

There's one case where this affects: multi-column join query:

SELECT t1.a, t1.b, t2.c, t2.d
FROM t1, t2
WHERE t1.a = t2.c AND t1.b = t2.d

V1 will find an index where the columns are compatible.

T1Index: IndexColumns: a,b
T2Index: IndexColumns: c,d

V2 may choose non-matching indexes like this

T1Index: IndexColumns: a,b
T2Index: IndexColumns: d,c      >> note the reversed order of index  columns

THere's no guarantee that for multi-column indexes, the most compatible index pair is selected

Aug 27 '20 15:08 apoorvedave1

can work on higher level joins if one side of the join can be replaced by index and can eliminate shuffle.

Could you update this with a concrete example for others?

Nov 03 '20 02:11 imback82

Could you fix the build error? Let's merge this PR for the release. Thanks!

Nov 06 '20 00:11 sezruby