jdk
jdk copied to clipboard
8322770: Implement C2 VectorizedHashCode on AArch64
Hello,
Please review the following PR for JDK-8322770 Implement C2 VectorizedHashCode on AArch64. It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively.
The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.
At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.
Performance
Neoverse N1
--------------------------------------------------------------------------------------------
Version Baseline This patch
--------------------------------------------------------------------------------------------
Benchmark (size) Mode Cnt Score Error Score Error Units
--------------------------------------------------------------------------------------------
ArraysHashCode.bytes 1 avgt 15 1.249 ? 0.060 1.247 ? 0.062 ns/op
ArraysHashCode.bytes 10 avgt 15 8.754 ? 0.028 4.387 ? 0.015 ns/op
ArraysHashCode.bytes 100 avgt 15 98.596 ? 0.051 26.655 ? 0.097 ns/op
ArraysHashCode.bytes 10000 avgt 15 10150.578 ? 1.352 2649.962 ? 216.744 ns/op
ArraysHashCode.chars 1 avgt 15 1.286 ? 0.062 1.246 ? 0.054 ns/op
ArraysHashCode.chars 10 avgt 15 8.731 ? 0.002 5.344 ? 0.003 ns/op
ArraysHashCode.chars 100 avgt 15 98.632 ? 0.048 23.023 ? 0.142 ns/op
ArraysHashCode.chars 10000 avgt 15 10150.658 ? 3.374 2410.504 ? 8.872 ns/op
ArraysHashCode.ints 1 avgt 15 1.189 ? 0.005 1.187 ? 0.001 ns/op
ArraysHashCode.ints 10 avgt 15 8.730 ? 0.002 5.676 ? 0.001 ns/op
ArraysHashCode.ints 100 avgt 15 98.559 ? 0.016 24.378 ? 0.006 ns/op
ArraysHashCode.ints 10000 avgt 15 10148.752 ? 1.336 2419.015 ? 0.492 ns/op
ArraysHashCode.multibytes 1 avgt 15 1.037 ? 0.001 1.037 ? 0.001 ns/op
ArraysHashCode.multibytes 10 avgt 15 5.481 ? 0.001 3.136 ? 0.001 ns/op
ArraysHashCode.multibytes 100 avgt 15 50.950 ? 0.006 15.277 ? 0.007 ns/op
ArraysHashCode.multibytes 10000 avgt 15 5335.181 ? 0.692 1340.850 ? 4.291 ns/op
ArraysHashCode.multichars 1 avgt 15 1.038 ? 0.001 1.037 ? 0.001 ns/op
ArraysHashCode.multichars 10 avgt 15 5.480 ? 0.001 3.783 ? 0.001 ns/op
ArraysHashCode.multichars 100 avgt 15 50.955 ? 0.006 13.890 ? 0.018 ns/op
ArraysHashCode.multichars 10000 avgt 15 5338.597 ? 0.853 1335.599 ? 0.652 ns/op
ArraysHashCode.multiints 1 avgt 15 1.042 ? 0.001 1.043 ? 0.001 ns/op
ArraysHashCode.multiints 10 avgt 15 5.526 ? 0.001 3.866 ? 0.001 ns/op
ArraysHashCode.multiints 100 avgt 15 50.917 ? 0.005 14.918 ? 0.026 ns/op
ArraysHashCode.multiints 10000 avgt 15 5348.365 ? 5.836 1287.685 ? 1.083 ns/op
ArraysHashCode.multishorts 1 avgt 15 1.036 ? 0.001 1.037 ? 0.001 ns/op
ArraysHashCode.multishorts 10 avgt 15 5.480 ? 0.001 3.783 ? 0.001 ns/op
ArraysHashCode.multishorts 100 avgt 15 50.975 ? 0.034 13.890 ? 0.015 ns/op
ArraysHashCode.multishorts 10000 avgt 15 5338.790 ? 1.276 1337.034 ? 1.600 ns/op
ArraysHashCode.shorts 1 avgt 15 1.187 ? 0.001 1.187 ? 0.001 ns/op
ArraysHashCode.shorts 10 avgt 15 8.731 ? 0.002 5.342 ? 0.001 ns/op
ArraysHashCode.shorts 100 avgt 15 98.544 ? 0.013 23.017 ? 0.141 ns/op
ArraysHashCode.shorts 10000 avgt 15 10148.275 ? 1.119 2408.041 ? 1.478 ns/op
Neoverse N2, Neoverse V1
Performance metrics have been collected for these cores as well. They are similar to the results above and can be posted upon request.
Test
Full jtreg passed on AArch64 and x86.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8322770: Implement C2 VectorizedHashCode on AArch64 (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18487/head:pull/18487
$ git checkout pull/18487
Update a local copy of the PR:
$ git checkout pull/18487
$ git pull https://git.openjdk.org/jdk.git pull/18487/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18487
View PR using the GUI difftool:
$ git pr show -t 18487
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18487.diff
Webrev
Hi @mikabl-arm, welcome to this OpenJDK project and thanks for contributing!
We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.
If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user mikabl-arm" as summary for the issue.
If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.
@mikabl-arm This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8322770: Implement C2 VectorizedHashCode on AArch64
Reviewed-by: aph, adinn
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 165 new commits pushed to the master branch:
- 52ba72823be0c969ab873ead2863ec48f883210b: 8327114: Attach in Linux may have wrong behaviour when pid == ns_pid (Kubernetes debug container)
- 988a531b097ccbd699d233059d73f41cae24dc5b: 8340181: Shenandoah: Cleanup ShenandoahRuntime stubs
- 822a773873c42ea27a6be90da92b2b2c9fb8caee: 8340605: Open source several AWT PopupMenu tests
- 6514aef8403fa5fc09e5c064a783ff0f1fccd0cf: 8340419: ZGC: Create an UseLargePages adaptation of TestAllocateHeapAt.java
- ae4d2f15901bf02efceaac26ee4aa3ae666bf467: 8340621: Open source several AWT List tests
- dd56990962d58e4f482773f67bc43383d7748536: 8340639: Open source few more AWT List tests
- ade17ecb6cb5125d048401a878b557e5afefc08c: 8340560: Open Source several AWT/2D font and rendering tests
- 73ebb848fdb66861e912ea747c039ddd1f7a5f48: 8340721: Clarify special case handling of unboxedType and getWildcardType
- ed140f5d5e2dec1217e2efbee815d84306de0563: 8341101: [ARM32] Error: ShouldNotReachHere() in TemplateInterpreterGenerator::generate_math_entry after 8338694
- 082125d61e4b7e0fd53528c0271ca8be621f242b: 8340404: CharsetProvider specification updates
- ... and 155 more: https://git.openjdk.org/jdk/compare/4ff17c14a572a59b60d728c3626f430932eecea6...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @adinn) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
@mikabl-arm The following label will be automatically applied to this pull request:
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
/covered
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!
Just a trivial note: this change also improves the calculation of String.hashCode(). For instance, on V1
Benchmark size Improvement
StringHashCode.Algorithm.defaultLatin1 1 -2.86%
StringHashCode.Algorithm.defaultLatin1 10 45.84%
StringHashCode.Algorithm.defaultLatin1 100 79.43%
StringHashCode.Algorithm.defaultLatin1 10000 79.16%
StringHashCode.Algorithm.defaultUTF16 1 -1.57%
StringHashCode.Algorithm.defaultUTF16 10 41.83%
StringHashCode.Algorithm.defaultUTF16 100 80.01%
StringHashCode.Algorithm.defaultUTF16 10000 78.44%
SVE can give notable additional speedup only for very long strings (>1k).
Webrevs
- 20: Full - Incremental (1dbb1ddf)
- 19: Full - Incremental (7d1c6b77)
- 18: Full - Incremental (b55d4baa)
- 17: Full - Incremental (03849d62)
- 16: Full - Incremental (6f2bec34)
- 15: Full - Incremental (142fa5d0)
- 14: Full - Incremental (a28bbcd3)
- 13: Full - Incremental (b56be377)
- 12: Full - Incremental (091eecc5)
- 11: Full - Incremental (66b07903)
- 10: Full - Incremental (132baf86)
- 09: Full - Incremental (a824a742)
- 08: Full - Incremental (f5918cca)
- 07: Full - Incremental (6b8eb78c)
- 06: Full - Incremental (03821dfd)
- 05: Full - Incremental (bfa93695)
- 04: Full - Incremental (eb9708c9)
- 03: Full - Incremental (7ddae523)
- 02: Full - Incremental (8e9f8d0c)
- 01: Full (4c6812f6)
- 00: Full (f1920301)
Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.
Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.
@theRealAph , thank you for a suggestion. That's because current result (hash sum) has to multiplied by 31^4 between iterations, where 4 is the numbers of elements handled per iteration. It's possible to multiply all lanes of vmultiplication register by 31^4 with MUL (vector) or MUL (by element) on each loop iteration and merge them just once in the end as you suggested though. I tried this approach before and it displays worse performance results on the benchmarks compared to the following sequence used in this PR:
addv(vmultiplication, Assembler::T4S, vmultiplication);
umov(addend, vmultiplication, Assembler::S, 0); // Sign-extension isn't necessary
maddw(result, result, pow4, addend);
I can re-check and post the performance numbers here per a request.
I can re-check and post the performance numbers here per a request.
Please do. Please also post the code.
I can re-check and post the performance numbers here per a request.
Please do. Please also post the code.
@theRealAph , you may find the performance numbers and the code in https://github.com/mikabl-arm/jdk/commit/f844b116f1a01653f127238d3a258cd2da4e1aca
I can re-check and post the performance numbers here per a request.
Please do. Please also post the code.
@theRealAph , you may find the performance numbers and the code in mikabl-arm@f844b11
OK, thanks. I think I see the problem. Unfortunately I've come to the end of my working day, but I'll try to get back to you as soon as possible next week.
In addition, doing only one vector per iteration is very wasteful. A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency. By only issuing a single multiply-accumulate per iteration you're leaving a lot of performance on the table. I'd try to make the bulk width 16, and measure from there.
You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.
This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6.
public static int vectorizedHashCode(int result, int[] a, int fromIndex, int length) {
if (length < WIDTH) {
return hashCode(result, a, fromIndex, length);
}
int offset = fromIndex;
int[] sum = new int[WIDTH];
sum[WIDTH - 1] = result;
int[] temp = new int[WIDTH];
int remaining = length;
while (remaining >= WIDTH * 2) {
vmult(sum, sum, n31powerWIDTH);
vload(temp, a, offset);
vadd(sum, sum, temp);
offset += WIDTH;
remaining -= WIDTH;
}
vmult(sum, sum, n31powerWIDTH);
vload(temp, a, offset);
vadd(sum, sum, temp);
vmult(sum, sum, n31powersToWIDTH);
offset += WIDTH;
remaining -= WIDTH;
result = vadd(sum);
return hashCode(result, a, fromIndex + offset, remaining);
}
You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.
This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6.
@theRealAph , looks reasonable, thank you for providing the listing! I'll revert back on this once I have updated performance numbers.
A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency.
@theRealAph , hmph, could you elaborate on what spec you refer to here?
A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency.
@theRealAph , hmph, could you elaborate on what spec you refer to here?
That's not so much a spec, more Dougall's measured Apple M1 performance: https://dougallj.github.io/applecpu/measurements/firestorm/UMLAL_v_4S.html. Other high-end AArch64 designs can't do that, but they won't suffer by going wider. We should be able to sustain pipelined 4 int-wide elements/cycle.
You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.
@theRealAph , I've tried to follow the suggested approach, please find the patch and result in https://github.com/mikabl-arm/jdk/commit/e352f30d89e99417231ae7bb66b325c68a76eef9 .
So far I wasn't able to see any performance benefits compared to an implementations from this MR.
You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.
@theRealAph , I've tried to follow the suggested approach, please find the patch and result in mikabl-arm@e352f30 .
So far I wasn't able to see any performance benefits compared to an implementations from this MR.
Yeah, true. I can see why that's happening from prof perfnorm:
4.30% ↗ 0x0000ffff70b3cdec: mul v1.4s, v1.4s, v3.4s
0.45% │ 0x0000ffff70b3cdf0: ld1 {v0.4s}, [x1], #16
81.54% │ 0x0000ffff70b3cdf4: add v1.4s, v1.4s, v0.4s
4.83% │ 0x0000ffff70b3cdf8: subs w2, w2, #4
3.55% ╰ 0x0000ffff70b3cdfc: b.hs #0xffff70b3cdec
ArraysHashCode.ints:IPC 1024 avgt 1.395 insns/clk
This is 1.4 insns/clk on a machine that can run 8 insns/clk. Because we're doing one load, then the MAC, then another load after the MAC, then a MAC that depends on the load: we stall the whole core waiting for the next load. Everything is serialized. Neoverse looks the same as Apple M1 here.
I guess the real question here is what we want. x86's engineers get this:
Benchmark (size) Mode Cnt Score Error Units
ArraysHashCode.ints 1 avgt 5 0.834 ± 0.001 ns/op
ArraysHashCode.ints 10 avgt 5 5.500 ± 0.016 ns/op
ArraysHashCode.ints 100 avgt 5 20.330 ± 0.103 ns/op
ArraysHashCode.ints 10000 avgt 5 1365.347 ± 1.045 ns/op
(And that's on my desktop box from 2018, an inferior piece of hardware.)
This is how they do it:
↗ 0x00007f0634c21c17: imul ebx,r11d
0.02% │ 0x00007f0634c21c1b: vmovdqu ymm12,YMMWORD PTR [rdi+rsi*4]
│ 0x00007f0634c21c20: vmovdqu ymm2,YMMWORD PTR [rdi+rsi*4+0x20]
5.36% │ 0x00007f0634c21c26: vmovdqu ymm0,YMMWORD PTR [rdi+rsi*4+0x40]
│ 0x00007f0634c21c2c: vmovdqu ymm1,YMMWORD PTR [rdi+rsi*4+0x60]
0.05% │ 0x00007f0634c21c32: vpmulld ymm8,ymm8,ymm3
11.12% │ 0x00007f0634c21c37: vpaddd ymm8,ymm8,ymm12
4.97% │ 0x00007f0634c21c3c: vpmulld ymm9,ymm9,ymm3
15.09% │ 0x00007f0634c21c41: vpaddd ymm9,ymm9,ymm2
5.16% │ 0x00007f0634c21c45: vpmulld ymm10,ymm10,ymm3
15.51% │ 0x00007f0634c21c4a: vpaddd ymm10,ymm10,ymm0
5.44% │ 0x00007f0634c21c4e: vpmulld ymm11,ymm11,ymm3
16.39% │ 0x00007f0634c21c53: vpaddd ymm11,ymm11,ymm1
4.80% │ 0x00007f0634c21c57: add esi,0x20
│ 0x00007f0634c21c5a: cmp esi,ecx
╰ 0x00007f0634c21c5c: jl 0x00007f0634c21c17
So, do we want to try to beat them on Arm, or not? They surely want to beat Arm.
Hi, is this one stuck? What you have today is definitely an improvement, even though it's not as good as what we have for x86. I guess we could commit this and leave widening the arithmetic for a later enhancement if you have no time to work on it.
Hi @theRealAph , following your suggestions I've got this working for ints and can confirm that it improves the performance. I don't have enough time at the moment to finish it for shorts and bytes though. I can update the patch with current results on Monday and we could decide how to proceed with this PR after that. Sounds good?
Hi,
I can update the patch with current results on Monday and we could decide how to proceed with this PR after that. Sounds good?
Yes, that's right.
Just as a note to not miss it later: the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457
I'm finishing up a patch, hopefully I'll push it later today.
Hi @theRealAph ! You may find the latest version here: https://github.com/mikabl-arm/jdk/commit/b3db421c795f683db1a001853990026bafc2ed4b . I gave a short explanation in the commit message, feel free to ask for more details if required.
Unfortunately, it still contains critical bugs and I won't be able to take a look into the issue before the next week at best. Until it's fixed, it's not possible to run the benchmarks. Although I expect it to improve performance on longer integer arrays based on a benchmark I've written in C++ and Assembly. The results aren't comparable to the jmh results, so I won't post them here.
Hi @theRealAph ! You may find the latest version here: mikabl-arm@b3db421 . I gave a short explanation in the commit message, feel free to ask for more details if required.
Unfortunately, it still contains critical bugs and I won't be able to take a look into the issue before the next week at best. Until it's fixed, it's not possible to run the benchmarks. Although I expect it to improve performance on longer integer arrays based on a benchmark I've written in C++ and Assembly. The results aren't comparable to the jmh results, so I won't post them here.
OK. One small thing, I think it's possible to rearrange things a bit to use mlav, which may help performance. No need for that until the code is correct, though.
@mikabl-arm This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Hi @mikabl-arm , the improvements for larger sizes look impressive, good work! any timeline for getting it merged?
Hi @snadampal ! Glad that you find the change useful :smile:
Thanks to @nick-arm I have some progress with fixing existing issues, so I'm looking forward to update the PR before next Tuesday.
Hi @theRealAph ! This took a while, but please find a fixed version here: https://github.com/mikabl-arm/jdk/tree/285826-vmul
Here are performance numbers collected for Neoverse V2 compared to the common baseline and the latest state of this PR:
| d2ea6b1e657 | f19203015fb | 5504227bfe3 |
| baseline | PR | 285826-vmul |
----------------------------------------------------------|---------------------------------------|------------------|------
Benchmark (size) Mode Cnt | Score Error | Score Error | Score Error | Units
----------------------------------------------------------|---------------------------------------|------------------|------
ArraysHashCode.bytes 1 avgt 15 | 0.859 ? 0.166 | 0.720 ? 0.103 | 0.732 ? 0.105 | ns/op
ArraysHashCode.bytes 10 avgt 15 | 4.440 ? 0.013 | 2.262 ? 0.009 | 3.454 ? 0.057 | ns/op
ArraysHashCode.bytes 100 avgt 15 | 78.642 ? 0.119 | 15.997 ? 0.023 | 12.753 ? 0.072 | ns/op
ArraysHashCode.bytes 10000 avgt 15 | 9248.961 ? 11.332 | 1879.905 ? 11.609 | 1345.014 ? 1.947 | ns/op
ArraysHashCode.chars 1 avgt 15 | 0.695 ? 0.036 | 0.694 ? 0.035 | 0.682 ? 0.036 | ns/op
ArraysHashCode.chars 10 avgt 15 | 4.436 ? 0.015 | 2.428 ? 0.034 | 3.352 ? 0.031 | ns/op
ArraysHashCode.chars 100 avgt 15 | 78.660 ? 0.113 | 14.508 ? 0.075 | 11.784 ? 0.088 | ns/op
ArraysHashCode.chars 10000 avgt 15 | 9253.807 ? 13.660 | 2010.053 ? 3.549 | 1344.716 ? 1.936 | ns/op
ArraysHashCode.ints 1 avgt 15 | 0.635 ? 0.022 | 0.640 ? 0.022 | 0.640 ? 0.022 | ns/op
ArraysHashCode.ints 10 avgt 15 | 4.424 ? 0.006 | 2.752 ? 0.012 | 3.388 ? 0.004 | ns/op
ArraysHashCode.ints 100 avgt 15 | 78.680 ? 0.120 | 14.794 ? 0.131 | 11.090 ? 0.055 | ns/op
ArraysHashCode.ints 10000 avgt 15 | 9249.520 ? 13.305 | 1997.441 ? 3.299 | 1340.916 ? 1.843 | ns/op
ArraysHashCode.multibytes 1 avgt 15 | 0.566 ? 0.023 | 0.563 ? 0.021 | 0.554 ? 0.012 | ns/op
ArraysHashCode.multibytes 10 avgt 15 | 2.679 ? 0.018 | 1.798 ? 0.038 | 1.973 ? 0.021 | ns/op
ArraysHashCode.multibytes 100 avgt 15 | 36.934 ? 0.055 | 9.118 ? 0.018 | 12.712 ? 0.026 | ns/op
ArraysHashCode.multibytes 10000 avgt 15 | 4861.700 ? 6.563 | 1005.809 ? 2.260 | 721.366 ? 1.570 | ns/op
ArraysHashCode.multichars 1 avgt 15 | 0.557 ? 0.016 | 0.552 ? 0.001 | 0.563 ? 0.021 | ns/op
ArraysHashCode.multichars 10 avgt 15 | 2.700 ? 0.018 | 1.840 ? 0.024 | 1.978 ? 0.008 | ns/op
ArraysHashCode.multichars 100 avgt 15 | 36.932 ? 0.054 | 8.633 ? 0.020 | 8.678 ? 0.052 | ns/op
ArraysHashCode.multichars 10000 avgt 15 | 4859.462 ? 6.693 | 1063.788 ? 3.057 | 752.857 ? 5.262 | ns/op
ArraysHashCode.multiints 1 avgt 15 | 0.574 ? 0.023 | 0.554 ? 0.011 | 0.559 ? 0.017 | ns/op
ArraysHashCode.multiints 10 avgt 15 | 2.707 ? 0.028 | 1.907 ? 0.031 | 1.992 ? 0.036 | ns/op
ArraysHashCode.multiints 100 avgt 15 | 36.942 ? 0.056 | 9.141 ? 0.013 | 8.174 ? 0.029 | ns/op
ArraysHashCode.multiints 10000 avgt 15 | 4872.540 ? 7.479 | 1187.393 ? 12.083 | 785.256 ? 9.472 | ns/op
ArraysHashCode.multishorts 1 avgt 15 | 0.558 ? 0.016 | 0.555 ? 0.012 | 0.566 ? 0.022 | ns/op
ArraysHashCode.multishorts 10 avgt 15 | 2.696 ? 0.015 | 1.854 ? 0.027 | 1.983 ? 0.009 | ns/op
ArraysHashCode.multishorts 100 avgt 15 | 36.930 ? 0.051 | 8.652 ? 0.011 | 8.681 ? 0.039 | ns/op
ArraysHashCode.multishorts 10000 avgt 15 | 4863.966 ? 6.736 | 1068.627 ? 1.902 | 760.280 ? 5.150 | ns/op
ArraysHashCode.shorts 1 avgt 15 | 0.665 ? 0.058 | 0.644 ? 0.022 | 0.636 ? 0.023 | ns/op
ArraysHashCode.shorts 10 avgt 15 | 4.431 ? 0.006 | 2.432 ? 0.024 | 3.332 ? 0.026 | ns/op
ArraysHashCode.shorts 100 avgt 15 | 78.630 ? 0.103 | 14.521 ? 0.077 | 11.783 ? 0.093 | ns/op
ArraysHashCode.shorts 10000 avgt 15 | 9249.908 ? 12.039 | 2010.461 ? 2.548 | 1344.441 ? 1.818 | ns/op
StringHashCode.Algorithm.defaultLatin1 1 avgt 15 | 0.770 ? 0.001 | 0.770 ? 0.001 | 0.770 ? 0.001 | ns/op
StringHashCode.Algorithm.defaultLatin1 10 avgt 15 | 4.305 ? 0.009 | 2.260 ? 0.009 | 3.433 ? 0.015 | ns/op
StringHashCode.Algorithm.defaultLatin1 100 avgt 15 | 78.355 ? 0.102 | 16.140 ? 0.038 | 12.767 ? 0.023 | ns/op
StringHashCode.Algorithm.defaultLatin1 10000 avgt 15 | 9269.665 ? 13.817 | 1893.354 ? 3.677 | 1345.571 ? 1.930 | ns/op
StringHashCode.Algorithm.defaultUTF16 1 avgt 15 | 0.736 ? 0.100 | 0.653 ? 0.083 | 0.690 ? 0.101 | ns/op
StringHashCode.Algorithm.defaultUTF16 10 avgt 15 | 4.280 ? 0.018 | 2.374 ? 0.021 | 3.394 ? 0.010 | ns/op
StringHashCode.Algorithm.defaultUTF16 100 avgt 15 | 78.312 ? 0.118 | 14.603 ? 0.103 | 11.837 ? 0.016 | ns/op
StringHashCode.Algorithm.defaultUTF16 10000 avgt 15 | 9249.562 ? 13.113 | 2011.717 ? 4.097 | 1344.715 ? 1.896 | ns/op
StringHashCode.cached N/A avgt 15 | 0.539 ? 0.027 | 0.525 ? 0.018 | 0.525 ? 0.018 | ns/op
StringHashCode.empty N/A avgt 15 | 0.861 ? 0.163 | 0.670 ? 0.079 | 0.694 ? 0.093 | ns/op
StringHashCode.notCached N/A avgt 15 | 0.698 ? 0.108 | 0.648 ? 0.024 | 0.637 ? 0.023 | ns/op
There are several known issues:
- [x] For arrays shorter than the number of elements processed by a single iteration of the Neon loop performance is not optimal, though still better than the baseline's.
- [x] The intrinsic take 364 Bytes in the worst case (for BYTE/BOOLEAN types) which may either significantly increase code size or limit inlining opportunities.
- [ ] As mentioned before, the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457 .
To address the first two we could implement the vectorized part of the algorithm as a separate stub method. Please let me know if this sound like a right approach or you have other suggestions.