hive HIVE-28480: Disable SMB on partition hash generator mismatch across join branches…

… in previous RS

What changes were proposed in this pull request?

As SMB replaces last RS op from the joining branches and the JOIN op with MERGEJOIN, we need to ensure the RS before these RS, in both branches, are partitioning using same hash generator.

Hash code generator differs based on ReducerTraits.UNIFORM i.e. ReduceSinkOperator#computeMurmurHash() or ReduceSinkOperator#computeHashCode(), leading to different hash code for same value.

Skip SMB join in such cases.

Why are the changes needed?

It fixes bug that results in incorrect join result. Example query and behavior explanation is added in https://issues.apache.org/jira/browse/HIVE-28480

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Tested via auto_sortmerge_join_18.q added in this PR

Aug 24 '24 21:08 himanshu-mishra

The patch seems reasonable. LGTM

Aug 26 '24 06:08 ngsg

@okumin if this looks good, please merge the PR

Aug 27 '24 19:08 himanshu-mishra

@himanshu-mishra Sorry, I don't have the right to merge a PR as I am not a committer. You need approval of any committer who is ideally an expert of SMB join.

Aug 28 '24 00:08 okumin

@kasakrisz can you please review or add reviewer

Aug 28 '24 07:08 himanshu-mishra

This is a good candidate to fix this issue, but have you checked why the reducer traits are different?

Consider the sample query

SELECT * FROM (
    SELECT k, COUNT(DISTINCT v), SUM(v)
    FROM t_asj_18 GROUP BY k
) a LEFT JOIN (
    SELECT k, COUNT(v)
    FROM t_asj_18 GROUP BY k
) b ON a.k = b.k;

In this query the aggregation k, count(distinct v).. group by k leads to group by and hence following RS operator's keys to be k, v while keeping partition column k, as we are aggregating on k. In such cases where key columns and partition columns differ, the UNIFORM trait is removed refer commit. Note that the group by keys and partition columns differ only with distinct aggregate, removing that would lead to both being a thereby having UNIFORM trait.