trafficstars

Hello,

Please review the following PR for JDK-8322770 Implement C2 VectorizedHashCode on AArch64. It follows previous work done in https://github.com/openjdk/jdk/pull/16629 and https://github.com/openjdk/jdk/pull/10847 for RISC-V and x86 respectively.

The code to calculate a hash code consists of two parts: a vectorized loop of Neon instruction that process 4 or 8 elements per iteration depending on the data type and a fully unrolled scalar "loop" that processes up to 7 tail elements.

At the time of writing this I don't see potential benefits from providing SVE/SVE2 implementation, but it could be added as a follow-up or independently later if required.

Performance

Neoverse N1

  --------------------------------------------------------------------------------------------
  Version                                            Baseline           This patch
  --------------------------------------------------------------------------------------------
  Benchmark                   (size)  Mode  Cnt      Score    Error     Score     Error  Units
  --------------------------------------------------------------------------------------------
  ArraysHashCode.bytes             1  avgt   15      1.249 ?  0.060     1.247 ?   0.062  ns/op
  ArraysHashCode.bytes            10  avgt   15      8.754 ?  0.028     4.387 ?   0.015  ns/op
  ArraysHashCode.bytes           100  avgt   15     98.596 ?  0.051    26.655 ?   0.097  ns/op
  ArraysHashCode.bytes         10000  avgt   15  10150.578 ?  1.352  2649.962 ? 216.744  ns/op
  ArraysHashCode.chars             1  avgt   15      1.286 ?  0.062     1.246 ?   0.054  ns/op
  ArraysHashCode.chars            10  avgt   15      8.731 ?  0.002     5.344 ?   0.003  ns/op
  ArraysHashCode.chars           100  avgt   15     98.632 ?  0.048    23.023 ?   0.142  ns/op
  ArraysHashCode.chars         10000  avgt   15  10150.658 ?  3.374  2410.504 ?   8.872  ns/op
  ArraysHashCode.ints              1  avgt   15      1.189 ?  0.005     1.187 ?   0.001  ns/op
  ArraysHashCode.ints             10  avgt   15      8.730 ?  0.002     5.676 ?   0.001  ns/op
  ArraysHashCode.ints            100  avgt   15     98.559 ?  0.016    24.378 ?   0.006  ns/op
  ArraysHashCode.ints          10000  avgt   15  10148.752 ?  1.336  2419.015 ?   0.492  ns/op
  ArraysHashCode.multibytes        1  avgt   15      1.037 ?  0.001     1.037 ?   0.001  ns/op
  ArraysHashCode.multibytes       10  avgt   15      5.481 ?  0.001     3.136 ?   0.001  ns/op
  ArraysHashCode.multibytes      100  avgt   15     50.950 ?  0.006    15.277 ?   0.007  ns/op
  ArraysHashCode.multibytes    10000  avgt   15   5335.181 ?  0.692  1340.850 ?   4.291  ns/op
  ArraysHashCode.multichars        1  avgt   15      1.038 ?  0.001     1.037 ?   0.001  ns/op
  ArraysHashCode.multichars       10  avgt   15      5.480 ?  0.001     3.783 ?   0.001  ns/op
  ArraysHashCode.multichars      100  avgt   15     50.955 ?  0.006    13.890 ?   0.018  ns/op
  ArraysHashCode.multichars    10000  avgt   15   5338.597 ?  0.853  1335.599 ?   0.652  ns/op
  ArraysHashCode.multiints         1  avgt   15      1.042 ?  0.001     1.043 ?   0.001  ns/op
  ArraysHashCode.multiints        10  avgt   15      5.526 ?  0.001     3.866 ?   0.001  ns/op
  ArraysHashCode.multiints       100  avgt   15     50.917 ?  0.005    14.918 ?   0.026  ns/op
  ArraysHashCode.multiints     10000  avgt   15   5348.365 ?  5.836  1287.685 ?   1.083  ns/op
  ArraysHashCode.multishorts       1  avgt   15      1.036 ?  0.001     1.037 ?   0.001  ns/op
  ArraysHashCode.multishorts      10  avgt   15      5.480 ?  0.001     3.783 ?   0.001  ns/op
  ArraysHashCode.multishorts     100  avgt   15     50.975 ?  0.034    13.890 ?   0.015  ns/op
  ArraysHashCode.multishorts   10000  avgt   15   5338.790 ?  1.276  1337.034 ?   1.600  ns/op
  ArraysHashCode.shorts            1  avgt   15      1.187 ?  0.001     1.187 ?   0.001  ns/op
  ArraysHashCode.shorts           10  avgt   15      8.731 ?  0.002     5.342 ?   0.001  ns/op
  ArraysHashCode.shorts          100  avgt   15     98.544 ?  0.013    23.017 ?   0.141  ns/op
  ArraysHashCode.shorts        10000  avgt   15  10148.275 ?  1.119  2408.041 ?   1.478  ns/op

Neoverse N2, Neoverse V1

Performance metrics have been collected for these cores as well. They are similar to the results above and can be posted upon request.

Test

Full jtreg passed on AArch64 and x86.

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8322770: Implement C2 VectorizedHashCode on AArch64 (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18487/head:pull/18487
$ git checkout pull/18487

Update a local copy of the PR:
$ git checkout pull/18487
$ git pull https://git.openjdk.org/jdk.git pull/18487/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18487

View PR using the GUI difftool:
$ git pr show -t 18487

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18487.diff

Webrev

Link to Webrev Comment

Mar 26 '24 13:03 mikabl-arm

Hi @mikabl-arm, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user mikabl-arm" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

Mar 26 '24 13:03 bridgekeeper[bot]

@mikabl-arm This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8322770: Implement C2 VectorizedHashCode on AArch64

Reviewed-by: aph, adinn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 165 new commits pushed to the master branch:

52ba72823be0c969ab873ead2863ec48f883210b: 8327114: Attach in Linux may have wrong behaviour when pid == ns_pid (Kubernetes debug container)
988a531b097ccbd699d233059d73f41cae24dc5b: 8340181: Shenandoah: Cleanup ShenandoahRuntime stubs
822a773873c42ea27a6be90da92b2b2c9fb8caee: 8340605: Open source several AWT PopupMenu tests
6514aef8403fa5fc09e5c064a783ff0f1fccd0cf: 8340419: ZGC: Create an UseLargePages adaptation of TestAllocateHeapAt.java
ae4d2f15901bf02efceaac26ee4aa3ae666bf467: 8340621: Open source several AWT List tests
dd56990962d58e4f482773f67bc43383d7748536: 8340639: Open source few more AWT List tests
ade17ecb6cb5125d048401a878b557e5afefc08c: 8340560: Open Source several AWT/2D font and rendering tests
73ebb848fdb66861e912ea747c039ddd1f7a5f48: 8340721: Clarify special case handling of unboxedType and getWildcardType
ed140f5d5e2dec1217e2efbee815d84306de0563: 8341101: [ARM32] Error: ShouldNotReachHere() in TemplateInterpreterGenerator::generate_math_entry after 8338694
082125d61e4b7e0fd53528c0271ca8be621f242b: 8340404: CharsetProvider specification updates
... and 155 more: https://git.openjdk.org/jdk/compare/4ff17c14a572a59b60d728c3626f430932eecea6...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@theRealAph, @adinn) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Mar 26 '24 14:03 openjdk[bot]

@mikabl-arm The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Mar 26 '24 14:03 openjdk[bot]

/covered

Mar 26 '24 14:03 mikabl-arm

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

Mar 26 '24 14:03 bridgekeeper[bot]

Just a trivial note: this change also improves the calculation of String.hashCode(). For instance, on V1

Benchmark	size	Improvement
StringHashCode.Algorithm.defaultLatin1	1	-2.86%
StringHashCode.Algorithm.defaultLatin1	10	45.84%
StringHashCode.Algorithm.defaultLatin1	100	79.43%
StringHashCode.Algorithm.defaultLatin1	10000	79.16%
StringHashCode.Algorithm.defaultUTF16	1	-1.57%
StringHashCode.Algorithm.defaultUTF16	10	41.83%
StringHashCode.Algorithm.defaultUTF16	100	80.01%
StringHashCode.Algorithm.defaultUTF16	10000	78.44%

SVE can give notable additional speedup only for very long strings (>1k).

Mar 28 '24 11:03 dchuyko

Webrevs

Apr 15 '24 22:04 mlbridge[bot]

Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.

Apr 16 '24 09:04 theRealAph

Why are you adding across lanes every time around the loop? You could maintain all of the lanes and then merge the lanes in the tail.

@theRealAph , thank you for a suggestion. That's because current result (hash sum) has to multiplied by 31^4 between iterations, where 4 is the numbers of elements handled per iteration. It's possible to multiply all lanes of vmultiplication register by 31^4 with MUL (vector) or MUL (by element) on each loop iteration and merge them just once in the end as you suggested though. I tried this approach before and it displays worse performance results on the benchmarks compared to the following sequence used in this PR:

    addv(vmultiplication, Assembler::T4S, vmultiplication);                                                                                                                                                                                                                                                                                                                                                                           
    umov(addend, vmultiplication, Assembler::S, 0); // Sign-extension isn't necessary                                                                                                                                                                                                                                                                                                                                                 
    maddw(result, result, pow4, addend);

I can re-check and post the performance numbers here per a request.

Apr 16 '24 10:04 mikabl-arm

I can re-check and post the performance numbers here per a request.

Please do. Please also post the code.

Apr 16 '24 12:04 theRealAph

I can re-check and post the performance numbers here per a request.

Please do. Please also post the code.

@theRealAph , you may find the performance numbers and the code in https://github.com/mikabl-arm/jdk/commit/f844b116f1a01653f127238d3a258cd2da4e1aca

Apr 18 '24 16:04 mikabl-arm

I can re-check and post the performance numbers here per a request.

Please do. Please also post the code.

@theRealAph , you may find the performance numbers and the code in mikabl-arm@f844b11

OK, thanks. I think I see the problem. Unfortunately I've come to the end of my working day, but I'll try to get back to you as soon as possible next week.

Apr 19 '24 17:04 theRealAph

In addition, doing only one vector per iteration is very wasteful. A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency. By only issuing a single multiply-accumulate per iteration you're leaving a lot of performance on the table. I'd try to make the bulk width 16, and measure from there.

Apr 22 '24 12:04 theRealAph

You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.

This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6.

    public static int vectorizedHashCode(int result, int[] a, int fromIndex, int length) {
        if (length < WIDTH) {
            return hashCode(result, a, fromIndex, length);
        }
        int offset = fromIndex;
        int[] sum = new int[WIDTH];
        sum[WIDTH - 1] = result;
        int[] temp = new int[WIDTH];
        int remaining = length;
        while (remaining >= WIDTH * 2) {
            vmult(sum, sum, n31powerWIDTH);
            vload(temp, a, offset);
            vadd(sum, sum, temp);
            offset += WIDTH;
            remaining -= WIDTH;
        }
        vmult(sum, sum, n31powerWIDTH);
        vload(temp, a, offset);
        vadd(sum, sum, temp);
        vmult(sum, sum, n31powersToWIDTH);
        offset += WIDTH;
        remaining -= WIDTH;
        result = vadd(sum);
        return hashCode(result, a, fromIndex + offset, remaining);
    }

Apr 22 '24 12:04 theRealAph

You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.

This is an example of how to do it. The full thing is at https://gist.github.com/theRealAph/cbc85299d6cd24101d46a06c12a97ce6.

@theRealAph , looks reasonable, thank you for providing the listing! I'll revert back on this once I have updated performance numbers.

Apr 22 '24 14:04 mikabl-arm

A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency.

@theRealAph , hmph, could you elaborate on what spec you refer to here?

Apr 22 '24 14:04 mikabl-arm

A high-performance AArch64 implementation can issue four multiply-accumulate vector instructions per cycle, with a 3-clock latency.

@theRealAph , hmph, could you elaborate on what spec you refer to here?

That's not so much a spec, more Dougall's measured Apple M1 performance: https://dougallj.github.io/applecpu/measurements/firestorm/UMLAL_v_4S.html. Other high-end AArch64 designs can't do that, but they won't suffer by going wider. We should be able to sustain pipelined 4 int-wide elements/cycle.

Apr 22 '24 15:04 theRealAph

You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.

@theRealAph , I've tried to follow the suggested approach, please find the patch and result in https://github.com/mikabl-arm/jdk/commit/e352f30d89e99417231ae7bb66b325c68a76eef9 .

So far I wasn't able to see any performance benefits compared to an implementations from this MR.

Apr 24 '24 13:04 mikabl-arm

You only need one load, add, and multiply per iteration. You don't need to add across columns until the end.

@theRealAph , I've tried to follow the suggested approach, please find the patch and result in mikabl-arm@e352f30 .

So far I wasn't able to see any performance benefits compared to an implementations from this MR.

Yeah, true. I can see why that's happening from prof perfnorm:

   4.30%  ↗   0x0000ffff70b3cdec:   mul		v1.4s, v1.4s, v3.4s
   0.45%  │   0x0000ffff70b3cdf0:   ld1		{v0.4s}, [x1], #16
  81.54%  │   0x0000ffff70b3cdf4:   add		v1.4s, v1.4s, v0.4s
   4.83%  │   0x0000ffff70b3cdf8:   subs		w2, w2, #4
   3.55%  ╰   0x0000ffff70b3cdfc:   b.hs		#0xffff70b3cdec

 ArraysHashCode.ints:IPC               1024  avgt          1.395          insns/clk

This is 1.4 insns/clk on a machine that can run 8 insns/clk. Because we're doing one load, then the MAC, then another load after the MAC, then a MAC that depends on the load: we stall the whole core waiting for the next load. Everything is serialized. Neoverse looks the same as Apple M1 here.

I guess the real question here is what we want. x86's engineers get this:

Benchmark            (size)  Mode  Cnt     Score    Error  Units
ArraysHashCode.ints       1  avgt    5     0.834 ±  0.001  ns/op
ArraysHashCode.ints      10  avgt    5     5.500 ±  0.016  ns/op
ArraysHashCode.ints     100  avgt    5    20.330 ±  0.103  ns/op
ArraysHashCode.ints   10000  avgt    5  1365.347 ±  1.045  ns/op

(And that's on my desktop box from 2018, an inferior piece of hardware.)

This is how they do it:

          ↗  0x00007f0634c21c17:   imul   ebx,r11d
   0.02%  │  0x00007f0634c21c1b:   vmovdqu ymm12,YMMWORD PTR [rdi+rsi*4]
          │  0x00007f0634c21c20:   vmovdqu ymm2,YMMWORD PTR [rdi+rsi*4+0x20]
   5.36%  │  0x00007f0634c21c26:   vmovdqu ymm0,YMMWORD PTR [rdi+rsi*4+0x40]
          │  0x00007f0634c21c2c:   vmovdqu ymm1,YMMWORD PTR [rdi+rsi*4+0x60]
   0.05%  │  0x00007f0634c21c32:   vpmulld ymm8,ymm8,ymm3
  11.12%  │  0x00007f0634c21c37:   vpaddd ymm8,ymm8,ymm12
   4.97%  │  0x00007f0634c21c3c:   vpmulld ymm9,ymm9,ymm3
  15.09%  │  0x00007f0634c21c41:   vpaddd ymm9,ymm9,ymm2
   5.16%  │  0x00007f0634c21c45:   vpmulld ymm10,ymm10,ymm3
  15.51%  │  0x00007f0634c21c4a:   vpaddd ymm10,ymm10,ymm0
   5.44%  │  0x00007f0634c21c4e:   vpmulld ymm11,ymm11,ymm3
  16.39%  │  0x00007f0634c21c53:   vpaddd ymm11,ymm11,ymm1
   4.80%  │  0x00007f0634c21c57:   add    esi,0x20
          │  0x00007f0634c21c5a:   cmp    esi,ecx
          ╰  0x00007f0634c21c5c:   jl     0x00007f0634c21c17

So, do we want to try to beat them on Arm, or not? They surely want to beat Arm.

Apr 24 '24 16:04 theRealAph

Hi, is this one stuck? What you have today is definitely an improvement, even though it's not as good as what we have for x86. I guess we could commit this and leave widening the arithmetic for a later enhancement if you have no time to work on it.

May 10 '24 07:05 theRealAph

Hi @theRealAph , following your suggestions I've got this working for ints and can confirm that it improves the performance. I don't have enough time at the moment to finish it for shorts and bytes though. I can update the patch with current results on Monday and we could decide how to proceed with this PR after that. Sounds good?

May 10 '24 12:05 mikabl-arm

Hi,

I can update the patch with current results on Monday and we could decide how to proceed with this PR after that. Sounds good?

Yes, that's right.

May 10 '24 12:05 theRealAph

Just as a note to not miss it later: the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457

May 14 '24 10:05 mikabl-arm

I'm finishing up a patch, hopefully I'll push it later today.

May 14 '24 10:05 mikabl-arm

Hi @theRealAph ! You may find the latest version here: https://github.com/mikabl-arm/jdk/commit/b3db421c795f683db1a001853990026bafc2ed4b . I gave a short explanation in the commit message, feel free to ask for more details if required.

Unfortunately, it still contains critical bugs and I won't be able to take a look into the issue before the next week at best. Until it's fixed, it's not possible to run the benchmarks. Although I expect it to improve performance on longer integer arrays based on a benchmark I've written in C++ and Assembly. The results aren't comparable to the jmh results, so I won't post them here.

May 15 '24 15:05 mikabl-arm

Hi @theRealAph ! You may find the latest version here: mikabl-arm@b3db421 . I gave a short explanation in the commit message, feel free to ask for more details if required.

Unfortunately, it still contains critical bugs and I won't be able to take a look into the issue before the next week at best. Until it's fixed, it's not possible to run the benchmarks. Although I expect it to improve performance on longer integer arrays based on a benchmark I've written in C++ and Assembly. The results aren't comparable to the jmh results, so I won't post them here.

OK. One small thing, I think it's possible to rearrange things a bit to use mlav, which may help performance. No need for that until the code is correct, though.

May 16 '24 12:05 theRealAph

@mikabl-arm This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Jun 13 '24 15:06 bridgekeeper[bot]

Hi @mikabl-arm , the improvements for larger sizes look impressive, good work! any timeline for getting it merged?

Jun 14 '24 16:06 snadampal

Hi @snadampal ! Glad that you find the change useful :smile:

Thanks to @nick-arm I have some progress with fixing existing issues, so I'm looking forward to update the PR before next Tuesday.

Jun 27 '24 16:06 mikabl-arm

Hi @theRealAph ! This took a while, but please find a fixed version here: https://github.com/mikabl-arm/jdk/tree/285826-vmul

Here are performance numbers collected for Neoverse V2 compared to the common baseline and the latest state of this PR:

                                                          |    d2ea6b1e657    |    f19203015fb    |    5504227bfe3   |
                                                          |     baseline      |        PR         |    285826-vmul   |
----------------------------------------------------------|---------------------------------------|------------------|------
Benchmark                               (size)  Mode  Cnt |    Score    Error |    Score    Error |    Score   Error | Units
----------------------------------------------------------|---------------------------------------|------------------|------
ArraysHashCode.bytes                         1  avgt   15 |    0.859 ?  0.166 |    0.720 ?  0.103 |    0.732 ? 0.105 | ns/op
ArraysHashCode.bytes                        10  avgt   15 |    4.440 ?  0.013 |    2.262 ?  0.009 |    3.454 ? 0.057 | ns/op
ArraysHashCode.bytes                       100  avgt   15 |   78.642 ?  0.119 |   15.997 ?  0.023 |   12.753 ? 0.072 | ns/op
ArraysHashCode.bytes                     10000  avgt   15 | 9248.961 ? 11.332 | 1879.905 ? 11.609 | 1345.014 ? 1.947 | ns/op
ArraysHashCode.chars                         1  avgt   15 |    0.695 ?  0.036 |    0.694 ?  0.035 |    0.682 ? 0.036 | ns/op
ArraysHashCode.chars                        10  avgt   15 |    4.436 ?  0.015 |    2.428 ?  0.034 |    3.352 ? 0.031 | ns/op
ArraysHashCode.chars                       100  avgt   15 |   78.660 ?  0.113 |   14.508 ?  0.075 |   11.784 ? 0.088 | ns/op
ArraysHashCode.chars                     10000  avgt   15 | 9253.807 ? 13.660 | 2010.053 ?  3.549 | 1344.716 ? 1.936 | ns/op
ArraysHashCode.ints                          1  avgt   15 |    0.635 ?  0.022 |    0.640 ?  0.022 |    0.640 ? 0.022 | ns/op
ArraysHashCode.ints                         10  avgt   15 |    4.424 ?  0.006 |    2.752 ?  0.012 |    3.388 ? 0.004 | ns/op
ArraysHashCode.ints                        100  avgt   15 |   78.680 ?  0.120 |   14.794 ?  0.131 |   11.090 ? 0.055 | ns/op
ArraysHashCode.ints                      10000  avgt   15 | 9249.520 ? 13.305 | 1997.441 ?  3.299 | 1340.916 ? 1.843 | ns/op
ArraysHashCode.multibytes                    1  avgt   15 |    0.566 ?  0.023 |    0.563 ?  0.021 |    0.554 ? 0.012 | ns/op
ArraysHashCode.multibytes                   10  avgt   15 |    2.679 ?  0.018 |    1.798 ?  0.038 |    1.973 ? 0.021 | ns/op
ArraysHashCode.multibytes                  100  avgt   15 |   36.934 ?  0.055 |    9.118 ?  0.018 |   12.712 ? 0.026 | ns/op
ArraysHashCode.multibytes                10000  avgt   15 | 4861.700 ?  6.563 | 1005.809 ?  2.260 |  721.366 ? 1.570 | ns/op
ArraysHashCode.multichars                    1  avgt   15 |    0.557 ?  0.016 |    0.552 ?  0.001 |    0.563 ? 0.021 | ns/op
ArraysHashCode.multichars                   10  avgt   15 |    2.700 ?  0.018 |    1.840 ?  0.024 |    1.978 ? 0.008 | ns/op
ArraysHashCode.multichars                  100  avgt   15 |   36.932 ?  0.054 |    8.633 ?  0.020 |    8.678 ? 0.052 | ns/op
ArraysHashCode.multichars                10000  avgt   15 | 4859.462 ?  6.693 | 1063.788 ?  3.057 |  752.857 ? 5.262 | ns/op
ArraysHashCode.multiints                     1  avgt   15 |    0.574 ?  0.023 |    0.554 ?  0.011 |    0.559 ? 0.017 | ns/op
ArraysHashCode.multiints                    10  avgt   15 |    2.707 ?  0.028 |    1.907 ?  0.031 |    1.992 ? 0.036 | ns/op
ArraysHashCode.multiints                   100  avgt   15 |   36.942 ?  0.056 |    9.141 ?  0.013 |    8.174 ? 0.029 | ns/op
ArraysHashCode.multiints                 10000  avgt   15 | 4872.540 ?  7.479 | 1187.393 ? 12.083 |  785.256 ? 9.472 | ns/op
ArraysHashCode.multishorts                   1  avgt   15 |    0.558 ?  0.016 |    0.555 ?  0.012 |    0.566 ? 0.022 | ns/op
ArraysHashCode.multishorts                  10  avgt   15 |    2.696 ?  0.015 |    1.854 ?  0.027 |    1.983 ? 0.009 | ns/op
ArraysHashCode.multishorts                 100  avgt   15 |   36.930 ?  0.051 |    8.652 ?  0.011 |    8.681 ? 0.039 | ns/op
ArraysHashCode.multishorts               10000  avgt   15 | 4863.966 ?  6.736 | 1068.627 ?  1.902 |  760.280 ? 5.150 | ns/op
ArraysHashCode.shorts                        1  avgt   15 |    0.665 ?  0.058 |    0.644 ?  0.022 |    0.636 ? 0.023 | ns/op
ArraysHashCode.shorts                       10  avgt   15 |    4.431 ?  0.006 |    2.432 ?  0.024 |    3.332 ? 0.026 | ns/op
ArraysHashCode.shorts                      100  avgt   15 |   78.630 ?  0.103 |   14.521 ?  0.077 |   11.783 ? 0.093 | ns/op
ArraysHashCode.shorts                    10000  avgt   15 | 9249.908 ? 12.039 | 2010.461 ?  2.548 | 1344.441 ? 1.818 | ns/op
StringHashCode.Algorithm.defaultLatin1       1  avgt   15 |    0.770 ?  0.001 |    0.770 ?  0.001 |    0.770 ? 0.001 | ns/op
StringHashCode.Algorithm.defaultLatin1      10  avgt   15 |    4.305 ?  0.009 |    2.260 ?  0.009 |    3.433 ? 0.015 | ns/op
StringHashCode.Algorithm.defaultLatin1     100  avgt   15 |   78.355 ?  0.102 |   16.140 ?  0.038 |   12.767 ? 0.023 | ns/op
StringHashCode.Algorithm.defaultLatin1   10000  avgt   15 | 9269.665 ? 13.817 | 1893.354 ?  3.677 | 1345.571 ? 1.930 | ns/op
StringHashCode.Algorithm.defaultUTF16        1  avgt   15 |    0.736 ?  0.100 |    0.653 ?  0.083 |    0.690 ? 0.101 | ns/op
StringHashCode.Algorithm.defaultUTF16       10  avgt   15 |    4.280 ?  0.018 |    2.374 ?  0.021 |    3.394 ? 0.010 | ns/op
StringHashCode.Algorithm.defaultUTF16      100  avgt   15 |   78.312 ?  0.118 |   14.603 ?  0.103 |   11.837 ? 0.016 | ns/op
StringHashCode.Algorithm.defaultUTF16    10000  avgt   15 | 9249.562 ? 13.113 | 2011.717 ?  4.097 | 1344.715 ? 1.896 | ns/op
StringHashCode.cached                      N/A  avgt   15 |    0.539 ?  0.027 |    0.525 ?  0.018 |    0.525 ? 0.018 | ns/op
StringHashCode.empty                       N/A  avgt   15 |    0.861 ?  0.163 |    0.670 ?  0.079 |    0.694 ? 0.093 | ns/op
StringHashCode.notCached                   N/A  avgt   15 |    0.698 ?  0.108 |    0.648 ?  0.024 |    0.637 ? 0.023 | ns/op

There are several known issues:

[x] For arrays shorter than the number of elements processed by a single iteration of the Neon loop performance is not optimal, though still better than the baseline's.
[x] The intrinsic take 364 Bytes in the worst case (for BYTE/BOOLEAN types) which may either significantly increase code size or limit inlining opportunities.
[ ] As mentioned before, the implementation might be affected by https://bugs.openjdk.org/browse/JDK-8139457 .

To address the first two we could implement the vectorized part of the algorithm as a separate stub method. Please let me know if this sound like a right approach or you have other suggestions.

Jul 05 '24 17:07 mikabl-arm

jdk jdk copied to clipboard

8322770: Implement C2 VectorizedHashCode on AArch64

Performance

Neoverse N1

Neoverse N2, Neoverse V1

Test

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard