jdk
jdk copied to clipboard
8328138: Optimize ArrayEquals on AArch64 & fix potential crash
Current implementation of ArrayEquals on AArch64 is quite complex, due to the variety of checks about alignment, tail processing, bus locking and so on. However, Modern Arm processors have eased such worries. Besides, we found crash when using lilliput. So we proposed to use a simple&straightforward flow of ArrayEquals. With this simplified ArrayEquals, we observed performance gains on the latest arm platforms(Neoverse N1&N2) Test case: org.openjdk.bench.java.util.ArraysEquals
1x vector length, 64-bit aligned array[0]
Test Case | N1 | N2 |
---|---|---|
testByteFalseBeginning | -21.42% | -13.37% |
testByteFalseEnd | 25.79% | 27.45% |
testByteFalseMid | 16.64% | 16.46% |
testByteTrue | 12.39% | 24.66% |
testCharFalseBeginning | -5.27% | -3.08% |
testCharFalseEnd | 29.29% | 35.23% |
testCharFalseMid | 15.13% | 19.34% |
testCharTrue | 21.63% | 33.73% |
Total | 11.77% | 17.55% |
A key factor is to decide when we should utilize simd in array equals. An aggressive choice is to enable simd as long as array length exceeds vector length(8 words). The corresponding result is shown above, from which we can see performance regression in both testBeginning cases. To avoid such perf impact, we can set simd threshold to 3x vector length.
3x vector length, 64-bit aligned array[0]
n1 | n2 | |
---|---|---|
testByteFalseBeginning | 8.28% | 8.64% |
testByteFalseEnd | 6.38% | 12.29% |
testByteFalseMid | 6.17% | 7.96% |
testByteTrue | -10.08% | 3.06% |
testCharFalseBeginning | -1.42% | 7.23% |
testCharFalseEnd | 4.05% | 13.48% |
testCharFalseMid | 8.79% | 16.96% |
testCharTrue | -5.66% | 10.23% |
Total | 2.06% | 9.98% |
In addtion to perf improvement, we propose this patch to solve alignment issues in array equals. JDK-8139457 tries to relax alignment of array elements. On the other hand, this misalignment makes it an error to read the whole last word in array equals, in case that the array doesn't occupy the whole word and lilliput is enabled. A detailed explaination quoted from https://github.com/openjdk/jdk/pull/11044#issuecomment-1996771480
The root cause is that default behavior of MacroAssembler::arrays_equals will blindly load whole word before comparison. When the array[0] is aligned to 32-bit, the last word load will exceed the array limit and may touch the next word beyong object layout in heap memory. If the next word which doesn't belong to object self happens to be the boundary of pages and G1 heap regions, the segmentation fault will be triggered. Loading the last word blindly is benign for 64-bit aligned array because it is always inside the object self.
Our patch fixed this problem, and again we would like to show the perf improvement when misalignment.
1x vector length, 32-bit aligned array[0]
N1 | N2 | |
---|---|---|
testByteFalseBeginning | -12.96% | -17.50% |
testByteFalseEnd | 29.43% | 32.19% |
testByteFalseMid | 24.30% | 17.54% |
testByteTrue | 16.57% | 24.40% |
testCharFalseBeginning | 2.40% | 0.60% |
testCharFalseEnd | 32.14% | 32.94% |
testCharFalseMid | 18.86% | 17.60% |
testCharTrue | 25.38% | 32.62% |
Total | 17.01% | 17.54% |
3x vector length, 32-bit aligned array[0]
N1 | N2 | |
---|---|---|
testByteFalseBeginning | 10.95% | 14.23% |
testByteFalseEnd | 13.83% | 12.35% |
testByteFalseMid | 11.18% | 10.31% |
testByteTrue | -7.13% | 6.61% |
testCharFalseBeginning | 0.38% | 7.20% |
testCharFalseEnd | 2.84% | 13.13% |
testCharFalseMid | 11.17% | 15.40% |
testCharTrue | -5.43% | 9.52% |
Total | 4.72% | 11.09% |
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8328138: Optimize ArrayEquals on AArch64 & fix potential crash (Enhancement - P3)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18292/head:pull/18292
$ git checkout pull/18292
Update a local copy of the PR:
$ git checkout pull/18292
$ git pull https://git.openjdk.org/jdk.git pull/18292/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18292
View PR using the GUI difftool:
$ git pr show -t 18292
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18292.diff
Webrev
:wave: Welcome back weixlu! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@weixlu The following label will be automatically applied to this pull request:
-
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
The implementation here serves as a prototype. Anyone interested is welcome to test the performance gain, by turning on/off UseNewCode.
I get the following results on a Graviton2, the columns are different vector-length multipliers. There doesn't seem to be a clear winner. 1 or 2 may be preferable.
1 | 2 | 3 | 4 | |
---|---|---|---|---|
testByteFalseBeginning | -8.36% | 16.64% | 16.58% | 16.60% |
testByteFalseEnd | 30.76% | -0.01% | -0.01% | -0.01% |
testByteFalseMid | 23.42% | -0.01% | -0.01% | -0.35% |
testByteTrue | 11.53% | -23.09% | -23.09% | -23.08% |
testCharFalseBeginning | -0.02% | 0.00% | -2.30% | -0.06% |
testCharFalseEnd | 32.62% | 32.44% | 0.02% | 0.02% |
testCharFalseMid | 21.03% | 20.75% | 7.14% | 6.94% |
testCharTrue | 23.90% | 23.90% | -10.95% | -10.88% |
Same numbers on a Graviton 3
1 | 2 | 3 | 4 | |
---|---|---|---|---|
testByteFalseBeginning | -17.72% | 11.98% | 14.92% | 15.38% |
testByteFalseEnd | 43.48% | 10.67% | 9.97% | 10.05% |
testByteFalseMid | 26.66% | 0.16% | -0.61% | -0.66% |
testByteTrue | 29.84% | 2.06% | 2.61% | 3.70% |
testCharFalseBeginning | 5.49% | 7.30% | -7.13% | -14.30% |
testCharFalseEnd | 49.84% | 49.46% | 14.17% | 14.18% |
testCharFalseMid | 30.88% | 29.38% | 13.75% | 14.38% |
testCharTrue | 47.08% | 47.58% | 12.45% | 12.41% |
Looks better overall, vector-length * 2 looks best, IMO.
I'm not entirely convinced this will perform better in real code.
There's a lot of conditional branches at the end, and these need only to be used in the case of a very short array. ArraysEquals repeatedly tests the same array, leading to perfect branch prediction. In the real world this doesn't happen, so the test may be misleading.
Array comparison, in the case of anything except a very short array, can be done by a sequence of wide loads, with a single (probably misaligned) load at the end instead of that sequence of conditional branches.
Hi, this patch looks interesting. All the arrays in the testcase used to test performance of this patch have the same length. Have you tested your patch with short (or variable ) array lengths? I just modified your testcase with a char array of length = 7 and the performance on an aarch64 machine with -XX:+UseNewCode is ~17% worse compared to the default and ~20% worse compared to the case with -XX:+UseSimpleArrayEquals turned on. I agree that it's not possible to gain better performance for each and every testcase but a lot of real world applications do contain a considerable amount of short strings/array compares. With a main comparison loop which processes more elements (64 in your case), it could add more compare-branch instructions as well for a short array leading to more number of instructions executed.
@weixlu This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@weixlu This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open
pull request command.