jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8328138: Optimize ArrayEquals on AArch64 & fix potential crash

Open weixlu opened this issue 11 months ago • 8 comments

Current implementation of ArrayEquals on AArch64 is quite complex, due to the variety of checks about alignment, tail processing, bus locking and so on. However, Modern Arm processors have eased such worries. Besides, we found crash when using lilliput. So we proposed to use a simple&straightforward flow of ArrayEquals. With this simplified ArrayEquals, we observed performance gains on the latest arm platforms(Neoverse N1&N2) Test case: org.openjdk.bench.java.util.ArraysEquals

1x vector length, 64-bit aligned array[0]

Test Case N1 N2
testByteFalseBeginning -21.42% -13.37%
testByteFalseEnd 25.79% 27.45%
testByteFalseMid 16.64% 16.46%
testByteTrue 12.39% 24.66%
testCharFalseBeginning -5.27% -3.08%
testCharFalseEnd 29.29% 35.23%
testCharFalseMid 15.13% 19.34%
testCharTrue 21.63% 33.73%
Total 11.77% 17.55%

A key factor is to decide when we should utilize simd in array equals. An aggressive choice is to enable simd as long as array length exceeds vector length(8 words). The corresponding result is shown above, from which we can see performance regression in both testBeginning cases. To avoid such perf impact, we can set simd threshold to 3x vector length.

3x vector length, 64-bit aligned array[0]

n1 n2
testByteFalseBeginning 8.28% 8.64%
testByteFalseEnd 6.38% 12.29%
testByteFalseMid 6.17% 7.96%
testByteTrue -10.08% 3.06%
testCharFalseBeginning -1.42% 7.23%
testCharFalseEnd 4.05% 13.48%
testCharFalseMid 8.79% 16.96%
testCharTrue -5.66% 10.23%
Total 2.06% 9.98%

In addtion to perf improvement, we propose this patch to solve alignment issues in array equals. JDK-8139457 tries to relax alignment of array elements. On the other hand, this misalignment makes it an error to read the whole last word in array equals, in case that the array doesn't occupy the whole word and lilliput is enabled. A detailed explaination quoted from https://github.com/openjdk/jdk/pull/11044#issuecomment-1996771480

The root cause is that default behavior of MacroAssembler::arrays_equals will blindly load whole word before comparison. When the array[0] is aligned to 32-bit, the last word load will exceed the array limit and may touch the next word beyong object layout in heap memory. If the next word which doesn't belong to object self happens to be the boundary of pages and G1 heap regions, the segmentation fault will be triggered. Loading the last word blindly is benign for 64-bit aligned array because it is always inside the object self.

Our patch fixed this problem, and again we would like to show the perf improvement when misalignment.

1x vector length, 32-bit aligned array[0]

N1 N2
testByteFalseBeginning -12.96% -17.50%
testByteFalseEnd 29.43% 32.19%
testByteFalseMid 24.30% 17.54%
testByteTrue 16.57% 24.40%
testCharFalseBeginning 2.40% 0.60%
testCharFalseEnd 32.14% 32.94%
testCharFalseMid 18.86% 17.60%
testCharTrue 25.38% 32.62%
Total 17.01% 17.54%

3x vector length, 32-bit aligned array[0]

N1 N2
testByteFalseBeginning 10.95% 14.23%
testByteFalseEnd 13.83% 12.35%
testByteFalseMid 11.18% 10.31%
testByteTrue -7.13% 6.61%
testCharFalseBeginning 0.38% 7.20%
testCharFalseEnd 2.84% 13.13%
testCharFalseMid 11.17% 15.40%
testCharTrue -5.43% 9.52%
Total 4.72% 11.09%

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8328138: Optimize ArrayEquals on AArch64 & fix potential crash (Enhancement - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18292/head:pull/18292
$ git checkout pull/18292

Update a local copy of the PR:
$ git checkout pull/18292
$ git pull https://git.openjdk.org/jdk.git pull/18292/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18292

View PR using the GUI difftool:
$ git pr show -t 18292

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18292.diff

Webrev

Link to Webrev Comment

weixlu avatar Mar 14 '24 06:03 weixlu

:wave: Welcome back weixlu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Mar 14 '24 06:03 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Mar 14 '24 06:03 openjdk[bot]

@weixlu The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Mar 14 '24 06:03 openjdk[bot]

Webrevs

mlbridge[bot] avatar Mar 14 '24 06:03 mlbridge[bot]

The implementation here serves as a prototype. Anyone interested is welcome to test the performance gain, by turning on/off UseNewCode.

weixlu avatar Mar 14 '24 06:03 weixlu

I get the following results on a Graviton2, the columns are different vector-length multipliers. There doesn't seem to be a clear winner. 1 or 2 may be preferable.

  1 2 3 4
testByteFalseBeginning -8.36% 16.64% 16.58% 16.60%
testByteFalseEnd 30.76% -0.01% -0.01% -0.01%
testByteFalseMid 23.42% -0.01% -0.01% -0.35%
testByteTrue 11.53% -23.09% -23.09% -23.08%
testCharFalseBeginning -0.02% 0.00% -2.30% -0.06%
testCharFalseEnd 32.62% 32.44% 0.02% 0.02%
testCharFalseMid 21.03% 20.75% 7.14% 6.94%
testCharTrue 23.90% 23.90% -10.95% -10.88%

rkennke avatar Mar 14 '24 12:03 rkennke

Same numbers on a Graviton 3

  1 2 3 4
testByteFalseBeginning -17.72% 11.98% 14.92% 15.38%
testByteFalseEnd 43.48% 10.67% 9.97% 10.05%
testByteFalseMid 26.66% 0.16% -0.61% -0.66%
testByteTrue 29.84% 2.06% 2.61% 3.70%
testCharFalseBeginning 5.49% 7.30% -7.13% -14.30%
testCharFalseEnd 49.84% 49.46% 14.17% 14.18%
testCharFalseMid 30.88% 29.38% 13.75% 14.38%
testCharTrue 47.08% 47.58% 12.45% 12.41%

Looks better overall, vector-length * 2 looks best, IMO.

rkennke avatar Mar 14 '24 15:03 rkennke

I'm not entirely convinced this will perform better in real code.

There's a lot of conditional branches at the end, and these need only to be used in the case of a very short array. ArraysEquals repeatedly tests the same array, leading to perfect branch prediction. In the real world this doesn't happen, so the test may be misleading.

Array comparison, in the case of anything except a very short array, can be done by a sequence of wide loads, with a single (probably misaligned) load at the end instead of that sequence of conditional branches.

theRealAph avatar Mar 15 '24 09:03 theRealAph

Hi, this patch looks interesting. All the arrays in the testcase used to test performance of this patch have the same length. Have you tested your patch with short (or variable ) array lengths? I just modified your testcase with a char array of length = 7 and the performance on an aarch64 machine with -XX:+UseNewCode is ~17% worse compared to the default and ~20% worse compared to the case with -XX:+UseSimpleArrayEquals turned on. I agree that it's not possible to gain better performance for each and every testcase but a lot of real world applications do contain a considerable amount of short strings/array compares. With a main comparison loop which processes more elements (64 in your case), it could add more compare-branch instructions as well for a short array leading to more number of instructions executed.

Bhavana-Kilambi avatar Mar 19 '24 15:03 Bhavana-Kilambi

@weixlu This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Apr 16 '24 22:04 bridgekeeper[bot]

@weixlu This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar May 15 '24 03:05 bridgekeeper[bot]