gpdb icon indicating copy to clipboard operation
gpdb copied to clipboard

Derive combined hashed spec outer join master

Open THANATOSLAVA opened this issue 3 years ago • 3 comments

Derive Combined Hashed Spec For Outer Joins

Issue: Outer join operations enforce unnecessary data redistribution, causing execution of ORCA plans to be much slower than that of planner plans.

Root cause: Unlike inner join operators, which derive hashed distribution spec from both relations; outer join operators only derive hashed distribution spec from the outer relation. The missing hashed distribution spec from the inner relation caused unnecessary data movement.

Solution: Derive combined hashed spec for outer join operations from both relations. Eg. Outer join of 10 relations will deliver a combined hashed spec with 10 equivalent specs.

Implementation:

  1. [CPhysicalLeftOuterHashJoin] -- Override PdsDerive (distribution spec derivation) in CPhysicalJoin. Add a case where both outer and inner relations are hash distributed. Return combined distribution spec if both outer and inner relations are hash distributed. Since NULLs are only added to unmatched rows, set Nullscolocated to false for all equivalent distribution specs of the inner relation.
  2. [CPhysicalHashJoin] -- In matching a hashed distribution spec in inner join operations, set Nulls Colocation to true. In matching a hashed distribution spec in outer join operations, set Nulls Colocation to false only if the join condition isn't null-aware. This reflects the Nulls Colocation property required for / delivered by the outer relation in hash join operations.
  3. [CDistributionSpecHashed] -- (1) Rewrite Combine function for hash distribution spec with linked list concatenation. (2) Rewrite Copy function with recursion to ensure deep copy. (3) Add Copy function to allow fNullsColocated configuration. (4) Enforce nulls colocation for hash redistribution. This is necessary when the non-null hash distribution request is not met. (5) Make ComputeEquivHashExprs recursive to compute hash expression for all equivalent hashed specs
  4. [CPhysicalFullMergeJoin] -- Fix PdsDerive (distribution spec derivation). In full joins, both tables are o uter tables. The join output is hash distributed by non-NULL join keys.
  5. [CPhysical*Join] -- Add is_null_aware member to all the classes using the AddHashOrMergeJoinAlternative template. If the join is null-aware, nulls colocation has to be set true in deriving/requesting hash distribution specs. If the join isn't null-aware, nulls colocation can be set false.
  6. [CXformUtils] -- Check if the join condition is composed of equality predicates only. The output is passed to AddHashOrMergeJoinAlternative for determination of join condition null-awareness.
  7. [CDistributionSpecTest] -- Add function test for hash spec combination and copy.
  8. [regress] -- Test hashed distribution spec derivation and motion enforcement in outer join with INDF join condition
  9. [minidump] -- MDP plan shape update: LOJNonNullRejectingPredicates, LOJReorderWithSimplePredicate, Remove -Distinct-From-Subquery. The rest are SpaceSize and scan order changes. Add LeftJoinNullsNotColocated. Added user examples to verify inner join matches the outer relation's Nulls Colocation.

THANATOSLAVA avatar Aug 31 '22 21:08 THANATOSLAVA

Why I don't see new minidump test cases?

gpopt avatar Aug 31 '22 22:08 gpopt

Why I don't see new minidump test cases?

Probably has to do with cherry-pick. Here's from cmakelist --

- DPv2GreedyOnly DPv2MinCardOnly DPv2QueryOnly LOJ-PushDown LeftJoinDPv2JoinOrder;
+ DPv2GreedyOnly DPv2MinCardOnly DPv2QueryOnly LOJ-PushDown LeftJoinDPv2JoinOrder
+ LeftJoinNullsNotColocated InnerJoinBroadcastTableHashSpec LeftJoinBroadcastTableHashSpec InnerJoinReplicatedTableHashSpec;

THANATOSLAVA avatar Sep 06 '22 16:09 THANATOSLAVA

I left some quick comments from skimming the code.

Nit - I appreciate your commit messages, but can we keep them to an 80 character line limit?

I fixed the format. Somehow it showed up as code blocks in the first place...

THANATOSLAVA avatar Sep 09 '22 17:09 THANATOSLAVA

Please rebase on the latest master and make sure CI pipeline is green.

gpopt avatar Sep 22 '22 19:09 gpopt