jdk
jdk copied to clipboard
8317721: RISC-V: Implement CRC32 intrinsic
Hi everyone! Please review this port of AArch64 _updateBytesCRC32, _updateByteBufferCRC32 and _updateCRC32 intrinsics. This patch introduces only the plain (non-vectorized, no Zbc) version.
Correctness checks
Tier 1/2 tests are ok.
Performance results on T-Head board
Results for enabled intrinsic:
Used test is test/micro/org/openjdk/bench/java/util/TestCRC32.java
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 24 | 3730.929 | 37.773 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 24 | 2126.673 | 2.032 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 24 | 1134.330 | 6.714 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 24 | 584.017 | 2.267 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 24 | 151.173 | 0.346 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 24 | 19.113 | 0.008 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 24 | 4.647 | 0.022 | ops/ms |
Results for disabled intrinsic:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 15 | 798.365 | 35.486 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 15 | 677.756 | 46.619 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 15 | 552.781 | 27.143 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 15 | 429.304 | 12.518 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 15 | 166.738 | 0.935 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 15 | 25.060 | 0.034 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 15 | 6.196 | 0.030 | ops/ms |
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8317721: RISC-V: Implement CRC32 intrinsic (Enhancement - P4)
Reviewers
- Vladimir Kempik (@VladimirKempik - Committer) ⚠️ Review applies to 64abc7b4
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17046/head:pull/17046
$ git checkout pull/17046
Update a local copy of the PR:
$ git checkout pull/17046
$ git pull https://git.openjdk.org/jdk.git pull/17046/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17046
View PR using the GUI difftool:
$ git pr show -t 17046
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17046.diff
Webrev
/cc hotspot-compiler
:wave: Welcome back ArsenyBochkarev! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@ArsenyBochkarev
The hotspot-compiler label was successfully added.
Webrevs
- 12: Full - Incremental (204e6caa)
- 11: Full - Incremental (95752910)
- 10: Full - Incremental (36b96465)
- 09: Full - Incremental (64abc7b4)
- 08: Full - Incremental (654b25b7)
- 07: Full - Incremental (d640af0e)
- 06: Full - Incremental (857dc20d)
- 05: Full (f5e2c52e)
- 04: Full - Incremental (b4616379)
- 03: Full - Incremental (046d5530)
- 02: Full - Incremental (a59481b4)
- 01: Full - Incremental (f7a4f0c7)
- 00: Full (d8d6968f)
Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:
-XX:-UseZba:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3563.320 | 3.326 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1928.837 | 2.234 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1005.273 | 1.953 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 512.550 | 1.718 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 130.396 | 0.341 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 16.319 | 0.073 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 3.913 | 0.011 | ops/ms |
-XX:+UseZba:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4206.654 | 0.547 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2308.843 | 3.565 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1214.727 | 0.305 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 623.173 | 0.651 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 158.965 | 0.376 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 19.934 | 0.055 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 4.730 | 0.007 | ops/ms |
-XX:-UseCRC32Intrinsics:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 1390.530 | 42.217 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1109.742 | 24.201 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 805.345 | 12.155 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 520.965 | 5.651 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 169.591 | 0.747 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 22.624 | 0.139 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.430 | 0.016 | ops/ms |
Thanks! (not a review, just an ack)
There are two other version we probably need also, using carry-less-multiplication. There is a scalar clmul in Zbc and there is a vclmul in vector.
Performance measurements on the same benchmark, on T-Head board. I used the -XX:TieredStopAtLevel=1 flag:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3617.860 | 17.463 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2253.626 | 4.739 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1242.516 | 83.245 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 687.339 | 1.712 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 183.016 | 0.258 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 23.368 | 0.133 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.640 | 0.023 | ops/ms |
Current results with no such flag:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4270.963 | 47.669 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2448.523 | 72.738 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1360.692 | 30.246 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 709.729 | 7.413 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 183.903 | 1.051 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 23.320 | 0.038 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.651 | 0.017 | ops/ms |
Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:
-XX:-UseZba:
Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 512.550 1.718 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 130.396 0.341 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 16.319 0.073 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 3.913 0.011 ops/ms
-XX:+UseZba:
Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 623.173 0.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 158.965 0.376 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 19.934 0.055 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 4.730 0.007 ops/ms
-XX:-UseCRC32Intrinsics:
Benchmark (count) Mode Cnt Score Error Units
CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms
CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms
CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms
CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/ms
Seems there is regression when count >= 512, especially when count >= 2048. And I suppose that big message is common case for CRC32 usage?
Performance comparison for disabling/enabling Zba on StarFive VisionFive 2 board:
-XX:-UseZba:Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 512.550 1.718 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 130.396 0.341 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 16.319 0.073 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 3.913 0.011 ops/ms
-XX:+UseZba:Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 623.173 0.651 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 158.965 0.376 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 19.934 0.055 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 4.730 0.007 ops/ms
-XX:-UseCRC32Intrinsics:Benchmark (count) Mode Cnt Score Error Units CRC32.TestCRC32.testCRC32Update 512 thrpt 12 520.965 5.651 ops/ms CRC32.TestCRC32.testCRC32Update 2048 thrpt 12 169.591 0.747 ops/ms CRC32.TestCRC32.testCRC32Update 16384 thrpt 12 22.624 0.139 ops/ms CRC32.TestCRC32.testCRC32Update 65536 thrpt 12 5.430 0.016 ops/msSeems there is regression when
count >= 512, especially whencount >= 2048. And I suppose that big message is common case for CRC32 usage?
Hmm, I don't know about common CRC32 count parameter sizes, maybe others know? :sweat: Or maybe anyone knows if there are other ways to optimize such plain version of intrinsic more?
Hello again everyone! I was able to optimize regressions for most cases on big amount of data by partially unrolling the big loop and disposing from loop counter (previously in len register). Results for -XX:+UseZba of StarFive VisionFive2 board:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4215.728 | 3.972 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2607.882 | 1.627 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1364.899 | 8.857 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 704.316 | 3.222 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 180.738 | 0.474 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 22.722 | 0.059 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.327 | 0.019 | ops/ms |
while the results for -XX:-UseCRC32Intrinsics are here
Same the performance trend is that: the larger the data size, the closer the performance gap. when size is 65536, there seems a little perf regression.
So I wonder how it will behave when the size is bigger than 65536, and whether we need to consider the size bigger than 65536 depends on what's the expected regular data size of java CRC32, are the larger data size (equal or larger than 65536) common cases?
@ArsenyBochkarev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@ArsenyBochkarev This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8317721: RISC-V: Implement CRC32 intrinsic
Reviewed-by: vkempik, rehn
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 1388 new commits pushed to the master branch:
- 3ca2bcd402042791d7460dd79ee16a3f88436b3e: 8335060: ClassCastException after JDK-8294960
- 747e1e47f576b0ca3ac97d1deea87418e67ff2d1: 8334295: CTW: update modules
- 0a6ffa57954ddf4f92205205a5a1bada813d127a: 8261242: [Linux] OSContainer::is_containerized() returns true when run outside a container
- 71e3798bf67cddef37a8b4e377c4bf21dbd01567: 8335308: compiler/uncommontrap/DeoptReallocFailure.java times out with SerialGC on Windows
- c7e9ebb4cfff56b7a977eb2942f563f96b3336bd: 8331732: [PPC64] Unify and optimize code which converts != 0 to 1
- 53242cdf9ef17c502ebd541e84370e7c158639c1: 8335283: Build failure due to 'no_sanitize' attribute directive ignored
- d9bcf061450ebfb7fe02b5a50c855db1d9178e5d: 8335217: Fix memory ordering in ClassLoaderData::ChunkedHandleList
- bb18498d71dddf49db9bdfac886aed9ae123651d: 8335349: jcmd VM.classloaders "fold" option should be optional
- 8350b1daedae8ef5785a7165e664b1d3149b18b7: 8335294: Fix simple -Wzero-as-null-pointer-constant warnings in gc code
- 5d866bf17d96bd0f0e4545d7eee5912eda2e3a94: 8335252: Reduce size of j.u.Formatter.Conversion#isValid
- ... and 1378 more: https://git.openjdk.org/jdk/compare/fb390d202c8bbbbb87ba48fd01387feb35a1b768...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@feilongjiang, @Hamlin-Li, @RealFYang, @VladimirKempik, @robehn) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
@ArsenyBochkarev this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8317721
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
I managed to get some additional acceleration for cases with Zba enabled. Updated data for StarFive VisionFive2:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4231.837 | 12.249 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2678.843 | 1.631 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1405.024 | 6.509 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 727.608 | 1.393 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 186.552 | 0.389 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 23.423 | 0.087 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.493 | 0.015 | ops/ms |
Results for disabled intrinsic are here
performance numbers on unmatched
Current data for HiFive Unmatched (no Zba here!):
Enabled intrinsic:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 3180.082 | ± 63.442 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1936.728 | ± 17.332 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1019.500 | ± 5.038 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 527.775 | ± 2.059 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 135.190 | ± 0.279 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 16.996 | ± 0.066 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 3.877 | ± 0.011 | ops/ms |
Disabled intrinsic:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 992.300 | ± 17.666 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 818.234 | ± 9.767 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 605.509 | ± 14.685 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 402.414 | ± 4.331 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 134.390 | ± 1.399 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 18.619 | ± 0.104 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 4.229 | ± 0.020 | ops/ms |
Perhaps we should do this intrinsic Zba-exclusive?
More accel for -XX:+UseZba case (on StarFive VisionFive2):
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4415.416 | 2.822 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2756.321 | 0.769 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1450.461 | 4.115 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 750.851 | 0.496 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 192.352 | 0.599 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 24.209 | 0.044 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.670 | 0.015 | ops/ms |
The results looks good now on VF2. Considering Zba is mandatory for RVA22U64 this can be accepted now. Can some reviewer take a look at this again please ?
Thanks for updating and continuous refinement.
Seems the performance gain (last column in the picture) introduced by intrinsic is getting less and less when the data size increasing. So IMHO, when data size is big enough, it brings performance regression rather than performance gain.
@Hamlin-Li I modified test/micro/org/openjdk/bench/java/util/TestCRC32.java a bit to see if the regression happens on increased data, and run it on VisionFive2 with Zba enabed:
Enabled intrinsic
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 131072 | thrpt | 40 | 2.841 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 262144 | thrpt | 40 | 1.420 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 524288 | thrpt | 40 | 0.709 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2097152 | thrpt | 40 | 0.176 | 0.001 | ops/ms |
Disabled intrinsic
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 131072 | thrpt | 40 | 2.729 | 0.003 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 262144 | thrpt | 40 | 1.367 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 524288 | thrpt | 40 | 0.684 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2097152 | thrpt | 40 | 0.170 | 0.001 | ops/ms |
| (count) | enabled/disabled |
|---|---|
| 131072 | 1,041040674 |
| 262144 | 1,038771031 |
| 524288 | 1.036549708 |
| 2097152 | 1,035294118 |
So since there are no regressions compared to C2-generated code with -XX:+UseZba, how about making the intrinsic Zba-exclusive?
Thanks for updating. Seems fine, but I'm not sure. Maybe see how others think about it.
Just FYI, as the trend of performance gain in this implementation is less and less as the data size grow larger, so I wonder if the CRC algorithm used in this implementation is optimal enough. Seems there're other more advanced algorithms which are supposed to bring more optimistic performance gains, and some of these algorithms are already implemented on other platforms in jdk.
@ArsenyBochkarev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Hello again everyone! I made two changes to this PR:
- Now the intrinsic is Zba-exclusive.
- I used the
slli+srliwcombination instead ofsrli+andiand rescheduled them a bit, making use oft6register.
Current results on StarFive VisionFive2 (with Zba) are:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4644.129 | 9.566 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2911.927 | 12.866 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1538.630 | 5.463 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 799.100 | 3.216 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 205.947 | 0.236 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 25.880 | 0.069 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 6.022 | 0.020 | ops/ms |
Results for disabled intrinsic on VisionFive2 (taken from here)
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 1390.530 | 42.217 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1109.742 | 24.201 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 805.345 | 12.155 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 520.965 | 5.651 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 169.591 | 0.747 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 22.624 | 0.139 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.430 | 0.016 | ops/ms |
@ArsenyBochkarev This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Hi all! Can anyone take another look, please?
Current numbers on VisionFive2: with intrinsic enabled:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 4644.129 | 9.566 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 2911.927 | 12.866 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 1538.630 | 5.463 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 799.100 | 3.216 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 205.947 | 0.236 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 25.880 | 0.069 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 6.022 | 0.020 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 131072 | thrpt | 40 | 3.030 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 262144 | thrpt | 40 | 1.514 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 524288 | thrpt | 40 | 0.757 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2097152 | thrpt | 40 | 0.188 | 0.001 | ops/ms |
and without intrinsic enabled:
| Benchmark | (count) | Mode | Cnt | Score | Error | Units |
|---|---|---|---|---|---|---|
| CRC32.TestCRC32.testCRC32Update | 64 | thrpt | 12 | 1390.530 | 42.217 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 128 | thrpt | 12 | 1109.742 | 24.201 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 256 | thrpt | 12 | 805.345 | 12.155 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 512 | thrpt | 12 | 520.965 | 5.651 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2048 | thrpt | 12 | 169.591 | 0.747 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 16384 | thrpt | 12 | 22.624 | 0.139 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 65536 | thrpt | 12 | 5.430 | 0.016 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 131072 | thrpt | 40 | 2.729 | 0.003 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 262144 | thrpt | 40 | 1.367 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 524288 | thrpt | 40 | 0.684 | 0.001 | ops/ms |
| CRC32.TestCRC32.testCRC32Update | 2097152 | thrpt | 40 | 0.170 | 0.001 | ops/ms |
Thanks everyone for all the comments! Sanity checks for ./test/hotspot/jtreg/compiler/codegen/CRCTest.java and test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32.java are ok.
/integrate
@ArsenyBochkarev Your change (at version 204e6caa409edfe581d8f9b9018dd41e87c19346) is now ready to be sponsored by a Committer.
/sponsor
Going to push as commit 2f4f6cc34c10c5519c74abbce8d1715013b50d5d.
Since your change was applied there have been 1388 commits pushed to the master branch:
- 3ca2bcd402042791d7460dd79ee16a3f88436b3e: 8335060: ClassCastException after JDK-8294960
- 747e1e47f576b0ca3ac97d1deea87418e67ff2d1: 8334295: CTW: update modules
- 0a6ffa57954ddf4f92205205a5a1bada813d127a: 8261242: [Linux] OSContainer::is_containerized() returns true when run outside a container
- 71e3798bf67cddef37a8b4e377c4bf21dbd01567: 8335308: compiler/uncommontrap/DeoptReallocFailure.java times out with SerialGC on Windows
- c7e9ebb4cfff56b7a977eb2942f563f96b3336bd: 8331732: [PPC64] Unify and optimize code which converts != 0 to 1
- 53242cdf9ef17c502ebd541e84370e7c158639c1: 8335283: Build failure due to 'no_sanitize' attribute directive ignored
- d9bcf061450ebfb7fe02b5a50c855db1d9178e5d: 8335217: Fix memory ordering in ClassLoaderData::ChunkedHandleList
- bb18498d71dddf49db9bdfac886aed9ae123651d: 8335349: jcmd VM.classloaders "fold" option should be optional
- 8350b1daedae8ef5785a7165e664b1d3149b18b7: 8335294: Fix simple -Wzero-as-null-pointer-constant warnings in gc code
- 5d866bf17d96bd0f0e4545d7eee5912eda2e3a94: 8335252: Reduce size of j.u.Formatter.Conversion#isValid
- ... and 1378 more: https://git.openjdk.org/jdk/compare/fb390d202c8bbbbb87ba48fd01387feb35a1b768...master
Your commit was automatically rebased without conflicts.