spark
spark copied to clipboard
[SPARK-40177][SQL] Simplify condition of form (a==b) || (a==null&&b==null) to a<=>b
What changes were proposed in this pull request?
New case is added in Boolean simplification to convert condition of form (a==b) || (a==null&&b==null) to a<=>b.
Why are the changes needed?
If the join condition is like key1==key2 || (key1==null && key2==null), join is executed as Broadcast Nested Loop Join as this condition doesn't satisfy equi join condition. BNLJ takes more time as compared to Sort merge or broadcast hash join. This condition can be converted to key1<=>key2 to make the join execute as Broadcast or sort merge join. It will improve the performance of queries which have join with condition which matches this pattern.
Sample query: val dfAns = df.join(df1, (df("v")===df1("x") or (isnull(df("v")) and isnull(df1("x")))), "leftanti")
Plan before change OptimizedPlan: Join LeftAnti, ((v#1 = x#15) || (isnull(v#1) && isnull(x#15))) :- LocalRelation [g#0, v#1, o#2, x#3] +- LocalRelation [x#15]
dfAns.queryExecution.executedPlan *(1) BroadcastNestedLoopJoin BuildRight, LeftAnti, ((v#256 = x#270) || (isnull(v#256) && isnull(x#270))) :- LocalTableScan [g#255, v#256, o#257, x#258] +- BroadcastExchange IdentityBroadcastMode, [id=#91] +- LocalTableScan [x#270]
Plan after change OptimizedPlan Join LeftAnti, (v#29 <=> x#79) :- LocalRelation [g#28, v#29, o#30, x#31] +- LocalRelation [x#79]
ExecutedPlan *(1) BroadcastHashJoin [coalesce(v#29, 0), isnull(v#29)], [coalesce(x#71, 0), isnull(x#71)], LeftAnti, BuildRight :- LocalTableScan [g#28, v#29, o#30, x#31] +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(coalesce(input[0, int, false], 0), isnull(input[0, int, false]))), [id=#57] +- LocalTableScan [x#71]
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests run
Can one of the admins verify this patch?
gently ping @cloud-fan @srowen Can you please help to verify this patch?
cc @sigmod @wangyum
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!