spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-45599][CORE] Use object equality in OpenHashSet

Open nchammas opened this issue 1 year ago • 9 comments

What changes were proposed in this pull request?

Change OpenHashSet to use object equality instead of cooperative equality when looking up keys.

Why are the changes needed?

This brings the behavior of OpenHashSet more in line with the semantics of java.util.HashSet, and fixes its behavior when comparing values for which equals and == return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the wrong position in the set. This leads to the bug described in SPARK-45599 and summarized in this comment.

Does this PR introduce any user-facing change?

Yes, it resolves the bug described in SPARK-45599.

OpenHashSet is used widely under the hood, including in:

  • OpenHashMap, which itself backs:
    • TypedImperativeAggregate
    • aggregate functions like percentile and mode
    • many algorithms in ML and MLlib
  • SQLOpenHashSet, which backs array functions like array_union and array_distinct

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on master but passing on this branch:

New tests passing both on master and this branch:

Was this patch authored or co-authored using generative AI tooling?

No.

nchammas avatar Feb 06 '24 02:02 nchammas

core/testOnly org.apache.spark.util.collection.* passes on my machine, but I haven't run the full test suite yet. I will monitor the full test results here on GitHub.

nchammas avatar Feb 06 '24 02:02 nchammas

Build looks good.

@revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.)

nchammas avatar Feb 06 '24 15:02 nchammas

Yes, I was watching this ticket because it is a correctness issue. The fix looks good to me.

cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1.

peter-toth avatar Feb 06 '24 16:02 peter-toth

I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this.

HeartSaVioR avatar Feb 07 '24 04:02 HeartSaVioR

I can hold cutting the tag of RC1 for this.

Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release.

peter-toth avatar Feb 07 '24 16:02 peter-toth

@revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now?

nchammas avatar Feb 16 '24 05:02 nchammas

If we go this direction and change OpenHashSet do we still need SQLOpenHashSet?

peter-toth avatar Feb 16 '24 06:02 peter-toth

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

cloud-fan avatar Feb 16 '24 06:02 cloud-fan

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

Yes, you are right. Probably the NaN special handling can be remvoved though.

peter-toth avatar Feb 16 '24 06:02 peter-toth

Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original fuzz test he ran now passes?)

nchammas avatar Feb 22 '24 16:02 nchammas

thanks, merging to master/3.5!

cloud-fan avatar Feb 26 '24 08:02 cloud-fan

Hi, @cloud-fan and @nchammas .

This broke Apache Spark branch-3.5.

Screenshot 2024-02-27 at 08 38 03

[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:136:4: Invalid message filter:
[error] Unknown category: `other-non-cooperative-equals`
[error]   @annotation.nowarn("cat=other-non-cooperative-equals")
[error]    ^
[error] one error found

dongjoon-hyun avatar Feb 27 '24 16:02 dongjoon-hyun

Sorry but I reverted this from branch-3.5 to unblock the community PRs.

  • https://github.com/apache/spark/commit/cbf25fb633f4bf2f83a6f6e39aafaa80bf47e160

Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance.

dongjoon-hyun avatar Feb 27 '24 16:02 dongjoon-hyun

@dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on https://github.com/scala/scala/pull/8120 and https://github.com/scala/scala/pull/10446.

I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against branch-3.5.

nchammas avatar Feb 27 '24 19:02 nchammas

Backport to 3.5: #45296

nchammas avatar Feb 27 '24 23:02 nchammas