jdk 8328138: Optimize ArrayEquals on AArch64 & fix potential crash

Current implementation of ArrayEquals on AArch64 is quite complex, due to the variety of checks about alignment, tail processing, bus locking and so on. However, Modern Arm processors have eased such worries. Besides, we found crash when using lilliput. So we proposed to use a simple&straightforward flow of ArrayEquals. With this simplified ArrayEquals, we observed performance gains on the latest arm platforms(Neoverse N1&N2) Test case: org.openjdk.bench.java.util.ArraysEquals

1x vector length, 64-bit aligned array[0]

Test Case	N1	N2
testByteFalseBeginning	-21.42%	-13.37%
testByteFalseEnd	25.79%	27.45%
testByteFalseMid	16.64%	16.46%
testByteTrue	12.39%	24.66%
testCharFalseBeginning	-5.27%	-3.08%
testCharFalseEnd	29.29%	35.23%
testCharFalseMid	15.13%	19.34%
testCharTrue	21.63%	33.73%
Total	11.77%	17.55%

A key factor is to decide when we should utilize simd in array equals. An aggressive choice is to enable simd as long as array length exceeds vector length(8 words). The corresponding result is shown above, from which we can see performance regression in both testBeginning cases. To avoid such perf impact, we can set simd threshold to 3x vector length.

3x vector length, 64-bit aligned array[0]

	n1	n2
testByteFalseBeginning	8.28%	8.64%
testByteFalseEnd	6.38%	12.29%
testByteFalseMid	6.17%	7.96%
testByteTrue	-10.08%	3.06%
testCharFalseBeginning	-1.42%	7.23%
testCharFalseEnd	4.05%	13.48%
testCharFalseMid	8.79%	16.96%
testCharTrue	-5.66%	10.23%
Total	2.06%	9.98%

In addtion to perf improvement, we propose this patch to solve alignment issues in array equals. JDK-8139457 tries to relax alignment of array elements. On the other hand, this misalignment makes it an error to read the whole last word in array equals, in case that the array doesn't occupy the whole word and lilliput is enabled. A detailed explaination quoted from https://github.com/openjdk/jdk/pull/11044#issuecomment-1996771480

The root cause is that default behavior of MacroAssembler::arrays_equals will blindly load whole word before comparison. When the array[0] is aligned to 32-bit, the last word load will exceed the array limit and may touch the next word beyong object layout in heap memory. If the next word which doesn't belong to object self happens to be the boundary of pages and G1 heap regions, the segmentation fault will be triggered. Loading the last word blindly is benign for 64-bit aligned array because it is always inside the object self.

Our patch fixed this problem, and again we would like to show the perf improvement when misalignment.

1x vector length, 32-bit aligned array[0]

	N1	N2
testByteFalseBeginning	-12.96%	-17.50%
testByteFalseEnd	29.43%	32.19%
testByteFalseMid	24.30%	17.54%
testByteTrue	16.57%	24.40%
testCharFalseBeginning	2.40%	0.60%
testCharFalseEnd	32.14%	32.94%
testCharFalseMid	18.86%	17.60%
testCharTrue	25.38%	32.62%
Total	17.01%	17.54%

3x vector length, 32-bit aligned array[0]

	N1	N2
testByteFalseBeginning	10.95%	14.23%
testByteFalseEnd	13.83%	12.35%
testByteFalseMid	11.18%	10.31%
testByteTrue	-7.13%	6.61%
testCharFalseBeginning	0.38%	7.20%
testCharFalseEnd	2.84%	13.13%
testCharFalseMid	11.17%	15.40%
testCharTrue	-5.43%	9.52%
Total	4.72%	11.09%

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8328138: Optimize ArrayEquals on AArch64 & fix potential crash (Enhancement - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18292/head:pull/18292
$ git checkout pull/18292

Update a local copy of the PR:
$ git checkout pull/18292
$ git pull https://git.openjdk.org/jdk.git pull/18292/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18292

View PR using the GUI difftool:
$ git pr show -t 18292

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18292.diff

Webrev

Link to Webrev Comment

Mar 14 '24 06:03 weixlu

:wave: Welcome back weixlu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Mar 14 '24 06:03 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

Mar 14 '24 06:03 openjdk[bot]

@weixlu The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Mar 14 '24 06:03 openjdk[bot]

Webrevs

00: Full (20d3e747)

Mar 14 '24 06:03 mlbridge[bot]

The implementation here serves as a prototype. Anyone interested is welcome to test the performance gain, by turning on/off UseNewCode.

Mar 14 '24 06:03 weixlu

I get the following results on a Graviton2, the columns are different vector-length multipliers. There doesn't seem to be a clear winner. 1 or 2 may be preferable.

	1	2	3	4
testByteFalseBeginning	-8.36%	16.64%	16.58%	16.60%
testByteFalseEnd	30.76%	-0.01%	-0.01%	-0.01%
testByteFalseMid	23.42%	-0.01%	-0.01%	-0.35%
testByteTrue	11.53%	-23.09%	-23.09%	-23.08%
testCharFalseBeginning	-0.02%	0.00%	-2.30%	-0.06%
testCharFalseEnd	32.62%	32.44%	0.02%	0.02%
testCharFalseMid	21.03%	20.75%	7.14%	6.94%
testCharTrue	23.90%	23.90%	-10.95%	-10.88%

Mar 14 '24 12:03 rkennke

Same numbers on a Graviton 3

	1	2	3	4
testByteFalseBeginning	-17.72%	11.98%	14.92%	15.38%
testByteFalseEnd	43.48%	10.67%	9.97%	10.05%
testByteFalseMid	26.66%	0.16%	-0.61%	-0.66%
testByteTrue	29.84%	2.06%	2.61%	3.70%
testCharFalseBeginning	5.49%	7.30%	-7.13%	-14.30%
testCharFalseEnd	49.84%	49.46%	14.17%	14.18%
testCharFalseMid	30.88%	29.38%	13.75%	14.38%
testCharTrue	47.08%	47.58%	12.45%	12.41%

Looks better overall, vector-length * 2 looks best, IMO.

Mar 14 '24 15:03 rkennke

I'm not entirely convinced this will perform better in real code.

There's a lot of conditional branches at the end, and these need only to be used in the case of a very short array. ArraysEquals repeatedly tests the same array, leading to perfect branch prediction. In the real world this doesn't happen, so the test may be misleading.

Array comparison, in the case of anything except a very short array, can be done by a sequence of wide loads, with a single (probably misaligned) load at the end instead of that sequence of conditional branches.

Mar 15 '24 09:03 theRealAph

Hi, this patch looks interesting. All the arrays in the testcase used to test performance of this patch have the same length. Have you tested your patch with short (or variable ) array lengths? I just modified your testcase with a char array of length = 7 and the performance on an aarch64 machine with -XX:+UseNewCode is ~17% worse compared to the default and ~20% worse compared to the case with -XX:+UseSimpleArrayEquals turned on. I agree that it's not possible to gain better performance for each and every testcase but a lot of real world applications do contain a considerable amount of short strings/array compares. With a main comparison loop which processes more elements (64 in your case), it could add more compare-branch instructions as well for a short array leading to more number of instructions executed.

Mar 19 '24 15:03 Bhavana-Kilambi

@weixlu This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Apr 16 '24 22:04 bridgekeeper[bot]

@weixlu This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

May 15 '24 03:05 bridgekeeper[bot]

jdk jdk copied to clipboard

8328138: Optimize ArrayEquals on AArch64 & fix potential crash

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard