jdk
jdk copied to clipboard
8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions
Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in macroAssembler_x86_poly.cpp.
- Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
- Would like to add an
InvalidKeyExceptioninPoly1305.java(see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
- Would like to add an
- Added a JMH perf test.
- JMH test had to use reflection (instead of existing
MacBench.java), since Poly1305 is not 'properly' registered with the provider.
- JMH test had to use reflection (instead of existing
Perf before:
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.digest 64 thrpt 8 2961300.661 ± 110554.162 ops/s
Poly1305DigestBench.digest 256 thrpt 8 1791912.962 ± 86696.037 ops/s
Poly1305DigestBench.digest 1024 thrpt 8 637413.054 ± 14074.655 ops/s
Poly1305DigestBench.digest 16384 thrpt 8 48762.991 ± 390.921 ops/s
Poly1305DigestBench.digest 1048576 thrpt 8 769.872 ± 1.402 ops/s
and after:
Benchmark (dataSize) (provider) Mode Cnt Score Error Units
Poly1305DigestBench.digest 64 thrpt 8 2841243.668 ± 154528.057 ops/s
Poly1305DigestBench.digest 256 thrpt 8 1662003.873 ± 95253.445 ops/s
Poly1305DigestBench.digest 1024 thrpt 8 1770028.718 ± 100847.766 ops/s
Poly1305DigestBench.digest 16384 thrpt 8 765547.287 ± 25883.825 ops/s
Poly1305DigestBench.digest 1048576 thrpt 8 14508.458 ± 56.147 ops/s
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582
$ git checkout pull/10582
Update a local copy of the PR:
$ git checkout pull/10582
$ git pull https://git.openjdk.org/jdk pull/10582/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 10582
View PR using the GUI difftool:
$ git pr show -t 10582
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10582.diff
Hi @vpaprotsk, welcome to this OpenJDK project and thanks for contributing!
We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.
If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user vpaprotsk" as summary for the issue.
If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.
/covered
I am part of Intel Java Team
@vpaprotsk The following labels will be automatically applied to this pull request:
hotspotsecurity
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.
Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!
/label hotspot-compiler
@vpaprotsk
The hotspot-compiler label was successfully added.
Webrevs
- 21: Full - Incremental (08ea45e5)
- 20: Full - Incremental (56aed9b1)
- 19: Full - Incremental (dbdfd1dc)
- 18: Full - Incremental (cbf49380)
- 17: Full - Incremental (58488f42)
- 16: Full (8f5942d9)
- 15: Full (a26ac7db)
- 14: Full - Incremental (2a225e42)
- 13: Full - Incremental (835fbe3a)
- 12: Full - Incremental (196ee35b)
- 11: Full - Incremental (2176caf8)
- 10: Full - Incremental (abfc68f4)
- 09: Full - Incremental (8b1b40f7)
- 08: Full - Incremental (da560452)
- 07: Full (120247d5)
- 06: Full (38d9e83c)
- 05: Full - Incremental (78fd8fd7)
- 04: Full - Incremental (883be106)
- 03: Full (de7e138b)
- 02: Full - Incremental (f048f938)
- 01: Full - Incremental (6a60c128)
- 00: Full (7e070d9e)
I executed some quick testing and this fails with:
[2022-10-21T09:54:28,696Z] # A fatal error has been detected by the Java Runtime Environment:
[2022-10-21T09:54:28,696Z] #
[2022-10-21T09:54:28,696Z] # Internal Error (/opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8380/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5903b026-cdbd-4aa4-8433-6a45fb7ee593/runs/f75b29aa-40ef-46a5-b323-3a80aaa9aa6b/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5358), pid=2385300, tid=2385302
[2022-10-21T09:54:28,696Z] # Error: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : vector_len == AVX_256bit ? VM_Version::supports_avx2() : vector_len == AVX_512bit ? VM_Version::supports_avx512bw() : 0) failed
[2022-10-21T09:54:28,696Z] #
[2022-10-21T09:54:28,696Z] # JRE version: (20.0) (fastdebug build )
[2022-10-21T09:54:28,696Z] # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-10-21-0733397.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
[2022-10-21T09:54:28,696Z] # Problematic frame:
[2022-10-21T09:54:28,696Z] # V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190
Hi @TobiHartmann , thanks for looking. Could you share CPU Model and flags from hs_err please?
Test: jdk/incubator/vector/VectorMaxConversionTests.java#id1
Flags: -ea -esa -XX:UseAVX=3 -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting -XX:+UseZGC
CPU: Intel 8358 (all AVX512 features).
I think the problem is this subtest runs with -XX:+UseKNLSettingVectorMaxConversionTests.java#L50 which limits AVX512 features.
Call stack:
V [libjvm.so+0x6e3bf0] Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190 (assembler_x86.cpp:5358)
V [libjvm.so+0x152a23b] MacroAssembler::poly1305_process_blocks_avx512(Register, Register, Register, Register, Register, Register, Register, Register)+0xc7b (macroAssembler_x86_poly.cpp:590)
V [libjvm.so+0x152c23d] MacroAssembler::poly1305_process_blocks(Register, Register, Register, Register)+0x3ad (macroAssembler_x86_poly.cpp:849)
V [libjvm.so+0x192dc00] StubGenerator::generate_poly1305_processBlocks()+0x170 (stubGenerator_x86_64.cpp:2069)
V [libjvm.so+0x1936a89] StubGenerator::generate_initial()+0x419 (stubGenerator_x86_64.cpp:3798)
V [libjvm.so+0x1937b78] StubGenerator_generate(CodeBuffer*, int)+0xf8 (stubGenerator_x86_64.hpp:526)
V [libjvm.so+0x198e695] StubRoutines::initialize1() [clone .part.0]+0x155 (stubRoutines.cpp:229)
V [libjvm.so+0xfc4342] init_globals()+0x32 (init.cpp:123)
V [libjvm.so+0x1a7268f] Threads::create_vm(JavaVMInitArgs*, bool*)+0x37f
(Apologies, ignore the Stash: fetch limbs directly commit.. got git commit command mixed up.. will force-push a fix to the crash in a sec)
@vpaprotsk Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information.
Thanks @vnkozlov, was able to reproduce. @TobiHartmann, I added supports_avx512vlbw check to UsePolyIntrinsics.
Thanks, I'll re-run testing.
@ascarpino Could you please also take a look at this PR?
Regarding the updated numbers and master v. optimized-and-disabled, those are looking pretty good. Looks like the break-even point is at 64 bytes and gets better from there which I think addresses my concerns.
Thanks for moving the conversion of R and A from the java code into the intrinsic. That certainly reduced the footprint on the java code with regard to performance and code flow.
Revised numbers with getLimbs() interface change. Compared to previous version that got limbs in IR, change is within deviation.. (mostly -1%)
| datasize | master | optimized | disabled | opt/mst | dis/mst |
|---|---|---|---|---|---|
| 32 | 3218169 | 3651078 | 3116558 | 1.13 | 0.97 |
| 64 | 2858030 | 3407518 | 2824903 | 1.19 | 0.99 |
| 128 | 2396796 | 3357224 | 2394802 | 1.40 | 1.00 |
| 256 | 1780679 | 3050142 | 1751130 | 1.71 | 0.98 |
| 512 | 1168824 | 2938952 | 1148479 | 2.51 | 0.98 |
| 1024 | 648772.1 | 2728454 | 687016.7 | 4.21 | 1.06 |
| 2048 | 357009 | 2393507 | 392928.2 | 6.70 | 1.10 |
| 16384 | 48854.33 | 903175.4 | 52874.78 | 18.49 | 1.08 |
| 1048576 | 771.461 | 14951.24 | 840.792 | 19.38 | 1.09 |
@vpaprotsk This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions
Reviewed-by: sviswanathan, vlivanov
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 119 new commits pushed to the master branch:
- cd6a203a3e9e4e2f96f6c8649b10af92f8d9c27b: 8297348: make CONF=xxx should match if xxx is an exact match
- 817e039bb5300e95ba60749f237f1243f72f4eeb: 8297352: configure should check pandoc version
- 15e2e2852b7024cf9a6d58fd7ccb2474c1730e09: 8297353: Regenerated checked-in html files with new pandoc
- b366d17a94e5b16710fd915ef4cf04aaf911b455: 8294073: Performance improvement for message digest implementations
- 57f5cfdeb52b160e58968fb177b4432b3e079607: 8296399: crlNumExtVal might be null inside X509CRLSelector::match
- 0b04a99245795c223a01d1cbe66a46d20e480c53: 8297347: Problem list compiler/debug/TestStress*.java
- 0ac01485d3cf65b35a6ae7431dafccbca7e21eee: 8297342: make LOG=debug is too verbose
- d0a7938eb7637acd0b1b559963a939dde30f6dcf: 8286575: Document how properties in java.security are parsed
- 5c3345404d850cf01d9629b48015f1783a32bfc0: 8296472: Remove ObjectLocker around appendToClassPathForInstrumentation call
- 08008139cc05a8271e7163eca47d2bc59db2049b: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods
- ... and 109 more: https://git.openjdk.org/jdk/compare/7357a1a379ed79c6754a8093eb108cd82062880a...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jatin-bhateja, @sviswa7, @iwanowww) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).
The PR looks good to me. @ascarpino Please let us know if the Java side changes look good to you. @iwanowww Please let us know if the compiler side changes look good to you.
Try to get clean build, pull in https://github.com/openjdk/jdk/pull/11065
(Build finally passing!)
Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks.
@iwanowww thanks for the reviews! As you have time, let me know what else you see or if its good for approval? Don't want to switch too much to another intrinsic yet, one crypto algorithm is about what I can fit into my brain at a time.
Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks.
Sure, I re-submitted testing.
EDIT: I see that Vladimir already did that.
@iwanowww Answered your review comments, please take a look again? Thanks again!
@iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?)
The test results look good. (Had to wait until testing is complete.)
/integrate
@vpaprotsk Your change (at version 08ea45e5d4ab071a56dd1a50d517dc3ed4428553) is now ready to be sponsored by a Committer.
/sponsor