[SPARK-45599][CORE] Use object equality in OpenHashSet
What changes were proposed in this pull request?
Change OpenHashSet to use object equality instead of cooperative equality when looking up keys.
Why are the changes needed?
This brings the behavior of OpenHashSet more in line with the semantics of java.util.HashSet, and fixes its behavior when comparing values for which equals and == return different results, like 0.0/-0.0 and NaN/NaN.
For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the wrong position in the set. This leads to the bug described in SPARK-45599 and summarized in this comment.
Does this PR introduce any user-facing change?
Yes, it resolves the bug described in SPARK-45599.
OpenHashSet is used widely under the hood, including in:
OpenHashMap, which itself backs:TypedImperativeAggregate- aggregate functions like
percentileandmode - many algorithms in ML and MLlib
SQLOpenHashSet, which backs array functions likearray_unionandarray_distinct
However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.
How was this patch tested?
New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.
New tests failing on master but passing on this branch:
- Handling 0.0 and -0.0 in
OpenHashSet - Handling NaN in
OpenHashSet - Handling 0.0 and -0.0 in
OpenHashMap - Handling 0.0 and -0.0 when computing percentile
New tests passing both on master and this branch:
- Handling 0.0, -0.0, and NaN in
array_union - Handling 0.0, -0.0, and NaN in
array_distinct - Handling 0.0, -0.0, and NaN in
GROUP BY - Normalizing -0 and -0.0
Was this patch authored or co-authored using generative AI tooling?
No.
core/testOnly org.apache.spark.util.collection.* passes on my machine, but I haven't run the full test suite yet. I will monitor the full test results here on GitHub.
Build looks good.
@revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.)
Yes, I was watching this ticket because it is a correctness issue. The fix looks good to me.
cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1.
I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this.
I can hold cutting the tag of RC1 for this.
Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release.
@revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now?
If we go this direction and change OpenHashSet do we still need SQLOpenHashSet?
SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.
SQLOpenHashSetalso handles null differently. Not sure ifOpenHashSetalready covers it.
Yes, you are right. Probably the NaN special handling can be remvoved though.
Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original fuzz test he ran now passes?)
thanks, merging to master/3.5!
Hi, @cloud-fan and @nchammas .
This broke Apache Spark branch-3.5.
[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:136:4: Invalid message filter:
[error] Unknown category: `other-non-cooperative-equals`
[error] @annotation.nowarn("cat=other-non-cooperative-equals")
[error] ^
[error] one error found
Sorry but I reverted this from branch-3.5 to unblock the community PRs.
- https://github.com/apache/spark/commit/cbf25fb633f4bf2f83a6f6e39aafaa80bf47e160
Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance.
@dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on https://github.com/scala/scala/pull/8120 and https://github.com/scala/scala/pull/10446.
I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against branch-3.5.
Backport to 3.5: #45296