hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28480: Disable SMB on partition hash generator mismatch across join branches…

Open himanshu-mishra opened this issue 1 year ago • 5 comments

… in previous RS

What changes were proposed in this pull request?

As SMB replaces last RS op from the joining branches and the JOIN op with MERGEJOIN, we need to ensure the RS before these RS, in both branches, are partitioning using same hash generator.

Hash code generator differs based on ReducerTraits.UNIFORM i.e. ReduceSinkOperator#computeMurmurHash() or ReduceSinkOperator#computeHashCode(), leading to different hash code for same value.

Skip SMB join in such cases.

Why are the changes needed?

It fixes bug that results in incorrect join result. Example query and behavior explanation is added in https://issues.apache.org/jira/browse/HIVE-28480

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Tested via auto_sortmerge_join_18.q added in this PR

himanshu-mishra avatar Aug 24 '24 21:08 himanshu-mishra

The patch seems reasonable. LGTM

ngsg avatar Aug 26 '24 06:08 ngsg

@okumin if this looks good, please merge the PR

himanshu-mishra avatar Aug 27 '24 19:08 himanshu-mishra

@himanshu-mishra Sorry, I don't have the right to merge a PR as I am not a committer. You need approval of any committer who is ideally an expert of SMB join.

okumin avatar Aug 28 '24 00:08 okumin

@kasakrisz can you please review or add reviewer

himanshu-mishra avatar Aug 28 '24 07:08 himanshu-mishra

This is a good candidate to fix this issue, but have you checked why the reducer traits are different?

Consider the sample query

SELECT * FROM (
    SELECT k, COUNT(DISTINCT v), SUM(v)
    FROM t_asj_18 GROUP BY k
) a LEFT JOIN (
    SELECT k, COUNT(v)
    FROM t_asj_18 GROUP BY k
) b ON a.k = b.k; 

In this query the aggregation k, count(distinct v).. group by k leads to group by and hence following RS operator's keys to be k, v while keeping partition column k, as we are aggregating on k. In such cases where key columns and partition columns differ, the UNIFORM trait is removed refer commit. Note that the group by keys and partition columns differ only with distinct aggregate, removing that would lead to both being a thereby having UNIFORM trait.

himanshu-mishra avatar Aug 30 '24 06:08 himanshu-mishra

@kasakrisz let me know if there are pending concerns.

himanshu-mishra avatar Sep 04 '24 09:09 himanshu-mishra

@kasakrisz let me know if there are pending concerns.

Sorry, I was occupied in the last few days. I left only one comment.

kasakrisz avatar Sep 05 '24 06:09 kasakrisz

@kasakrisz as you have approved the changes, can you help with merging the PR.

himanshu-mishra avatar Sep 06 '24 12:09 himanshu-mishra