jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions

Open vpaprotsk opened this issue 3 years ago • 8 comments

Handcrafted x86_64 asm for Poly1305. Main optimization is to process 16 message blocks at a time. For more details, left a lot of comments in macroAssembler_x86_poly.cpp.

  • Added new KAT test for Poly1305 and a fuzz test to compare intrinsic and java.
    • Would like to add an InvalidKeyException in Poly1305.java (see commented out block in that file), but that conflicts with the KAT. I do think we should detect (R==0 || S ==0) so would like advice please.
  • Added a JMH perf test.
    • JMH test had to use reflection (instead of existing MacBench.java), since Poly1305 is not 'properly' registered with the provider.

Perf before:

Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
Poly1305DigestBench.digest          64              thrpt    8  2961300.661 ± 110554.162  ops/s
Poly1305DigestBench.digest         256              thrpt    8  1791912.962 ±  86696.037  ops/s
Poly1305DigestBench.digest        1024              thrpt    8   637413.054 ±  14074.655  ops/s
Poly1305DigestBench.digest       16384              thrpt    8    48762.991 ±    390.921  ops/s
Poly1305DigestBench.digest     1048576              thrpt    8      769.872 ±      1.402  ops/s

and after:

Benchmark                   (dataSize)  (provider)   Mode  Cnt        Score        Error  Units
Poly1305DigestBench.digest          64              thrpt    8  2841243.668 ± 154528.057  ops/s
Poly1305DigestBench.digest         256              thrpt    8  1662003.873 ±  95253.445  ops/s
Poly1305DigestBench.digest        1024              thrpt    8  1770028.718 ± 100847.766  ops/s
Poly1305DigestBench.digest       16384              thrpt    8   765547.287 ±  25883.825  ops/s
Poly1305DigestBench.digest     1048576              thrpt    8    14508.458 ±     56.147  ops/s

Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10582/head:pull/10582
$ git checkout pull/10582

Update a local copy of the PR:
$ git checkout pull/10582
$ git pull https://git.openjdk.org/jdk pull/10582/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10582

View PR using the GUI difftool:
$ git pr show -t 10582

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10582.diff

vpaprotsk avatar Oct 05 '22 21:10 vpaprotsk

Hi @vpaprotsk, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user vpaprotsk" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

bridgekeeper[bot] avatar Oct 05 '22 21:10 bridgekeeper[bot]

/covered

vpaprotsk avatar Oct 05 '22 21:10 vpaprotsk

I am part of Intel Java Team

vpaprotsk avatar Oct 05 '22 21:10 vpaprotsk

@vpaprotsk The following labels will be automatically applied to this pull request:

  • hotspot
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Oct 05 '22 21:10 openjdk[bot]

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

bridgekeeper[bot] avatar Oct 05 '22 21:10 bridgekeeper[bot]

/label hotspot-compiler

vpaprotsk avatar Oct 05 '22 21:10 vpaprotsk

@vpaprotsk The hotspot-compiler label was successfully added.

openjdk[bot] avatar Oct 05 '22 21:10 openjdk[bot]

I executed some quick testing and this fails with:

[2022-10-21T09:54:28,696Z] # A fatal error has been detected by the Java Runtime Environment:
[2022-10-21T09:54:28,696Z] #
[2022-10-21T09:54:28,696Z] #  Internal Error (/opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8380/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5903b026-cdbd-4aa4-8433-6a45fb7ee593/runs/f75b29aa-40ef-46a5-b323-3a80aaa9aa6b/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5358), pid=2385300, tid=2385302
[2022-10-21T09:54:28,696Z] #  Error: assert(vector_len == AVX_128bit ? VM_Version::supports_avx() : vector_len == AVX_256bit ? VM_Version::supports_avx2() : vector_len == AVX_512bit ? VM_Version::supports_avx512bw() : 0) failed
[2022-10-21T09:54:28,696Z] #
[2022-10-21T09:54:28,696Z] # JRE version:  (20.0) (fastdebug build )
[2022-10-21T09:54:28,696Z] # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-10-21-0733397.tobias.hartmann.jdk2, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
[2022-10-21T09:54:28,696Z] # Problematic frame:
[2022-10-21T09:54:28,696Z] # V  [libjvm.so+0x6e3bf0]  Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190

TobiHartmann avatar Oct 21 '22 09:10 TobiHartmann

Hi @TobiHartmann , thanks for looking. Could you share CPU Model and flags from hs_err please?

vpaprotsk avatar Oct 21 '22 18:10 vpaprotsk

Test: jdk/incubator/vector/VectorMaxConversionTests.java#id1 Flags: -ea -esa -XX:UseAVX=3 -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting -XX:+UseZGC CPU: Intel 8358 (all AVX512 features).

I think the problem is this subtest runs with -XX:+UseKNLSettingVectorMaxConversionTests.java#L50 which limits AVX512 features.

Call stack:

V  [libjvm.so+0x6e3bf0]  Assembler::vpslldq(XMMRegister, XMMRegister, int, int)+0x190  (assembler_x86.cpp:5358)
V  [libjvm.so+0x152a23b]  MacroAssembler::poly1305_process_blocks_avx512(Register, Register, Register, Register, Register, Register, Register, Register)+0xc7b  (macroAssembler_x86_poly.cpp:590)
V  [libjvm.so+0x152c23d]  MacroAssembler::poly1305_process_blocks(Register, Register, Register, Register)+0x3ad  (macroAssembler_x86_poly.cpp:849)
V  [libjvm.so+0x192dc00]  StubGenerator::generate_poly1305_processBlocks()+0x170  (stubGenerator_x86_64.cpp:2069)
V  [libjvm.so+0x1936a89]  StubGenerator::generate_initial()+0x419  (stubGenerator_x86_64.cpp:3798)
V  [libjvm.so+0x1937b78]  StubGenerator_generate(CodeBuffer*, int)+0xf8  (stubGenerator_x86_64.hpp:526)
V  [libjvm.so+0x198e695]  StubRoutines::initialize1() [clone .part.0]+0x155  (stubRoutines.cpp:229)
V  [libjvm.so+0xfc4342]  init_globals()+0x32  (init.cpp:123)
V  [libjvm.so+0x1a7268f]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x37f  

vnkozlov avatar Oct 21 '22 18:10 vnkozlov

(Apologies, ignore the Stash: fetch limbs directly commit.. got git commit command mixed up.. will force-push a fix to the crash in a sec)

vpaprotsk avatar Oct 21 '22 20:10 vpaprotsk

@vpaprotsk Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information.

openjdk-notifier[bot] avatar Oct 21 '22 20:10 openjdk-notifier[bot]

Thanks @vnkozlov, was able to reproduce. @TobiHartmann, I added supports_avx512vlbw check to UsePolyIntrinsics.

vpaprotsk avatar Oct 21 '22 20:10 vpaprotsk

Thanks, I'll re-run testing.

TobiHartmann avatar Oct 24 '22 09:10 TobiHartmann

@ascarpino Could you please also take a look at this PR?

sviswa7 avatar Oct 24 '22 18:10 sviswa7

Regarding the updated numbers and master v. optimized-and-disabled, those are looking pretty good. Looks like the break-even point is at 64 bytes and gets better from there which I think addresses my concerns.

jnimeh avatar Nov 04 '22 14:11 jnimeh

Thanks for moving the conversion of R and A from the java code into the intrinsic. That certainly reduced the footprint on the java code with regard to performance and code flow.

ascarpino avatar Nov 04 '22 17:11 ascarpino

Revised numbers with getLimbs() interface change. Compared to previous version that got limbs in IR, change is within deviation.. (mostly -1%)

datasize master optimized disabled opt/mst dis/mst
32 3218169 3651078 3116558 1.13 0.97
64 2858030 3407518 2824903 1.19 0.99
128 2396796 3357224 2394802 1.40 1.00
256 1780679 3050142 1751130 1.71 0.98
512 1168824 2938952 1148479 2.51 0.98
1024 648772.1 2728454 687016.7 4.21 1.06
2048 357009 2393507 392928.2 6.70 1.10
16384 48854.33 903175.4 52874.78 18.49 1.08
1048576 771.461 14951.24 840.792 19.38 1.09

vpaprotsk avatar Nov 10 '22 03:11 vpaprotsk

@vpaprotsk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8288047: Accelerate Poly1305 on x86_64 using AVX512 instructions

Reviewed-by: sviswanathan, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 119 new commits pushed to the master branch:

  • cd6a203a3e9e4e2f96f6c8649b10af92f8d9c27b: 8297348: make CONF=xxx should match if xxx is an exact match
  • 817e039bb5300e95ba60749f237f1243f72f4eeb: 8297352: configure should check pandoc version
  • 15e2e2852b7024cf9a6d58fd7ccb2474c1730e09: 8297353: Regenerated checked-in html files with new pandoc
  • b366d17a94e5b16710fd915ef4cf04aaf911b455: 8294073: Performance improvement for message digest implementations
  • 57f5cfdeb52b160e58968fb177b4432b3e079607: 8296399: crlNumExtVal might be null inside X509CRLSelector::match
  • 0b04a99245795c223a01d1cbe66a46d20e480c53: 8297347: Problem list compiler/debug/TestStress*.java
  • 0ac01485d3cf65b35a6ae7431dafccbca7e21eee: 8297342: make LOG=debug is too verbose
  • d0a7938eb7637acd0b1b559963a939dde30f6dcf: 8286575: Document how properties in java.security are parsed
  • 5c3345404d850cf01d9629b48015f1783a32bfc0: 8296472: Remove ObjectLocker around appendToClassPathForInstrumentation call
  • 08008139cc05a8271e7163eca47d2bc59db2049b: 8293584: CodeCache::old_nmethods_do incorrectly filters is_unloading nmethods
  • ... and 109 more: https://git.openjdk.org/jdk/compare/7357a1a379ed79c6754a8093eb108cd82062880a...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jatin-bhateja, @sviswa7, @iwanowww) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk[bot] avatar Nov 11 '22 01:11 openjdk[bot]

The PR looks good to me. @ascarpino Please let us know if the Java side changes look good to you. @iwanowww Please let us know if the compiler side changes look good to you.

sviswa7 avatar Nov 11 '22 01:11 sviswa7

Try to get clean build, pull in https://github.com/openjdk/jdk/pull/11065

vpaprotsk avatar Nov 14 '22 17:11 vpaprotsk

(Build finally passing!)

Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks.

@iwanowww thanks for the reviews! As you have time, let me know what else you see or if its good for approval? Don't want to switch too much to another intrinsic yet, one crypto algorithm is about what I can fit into my brain at a time.

vpaprotsk avatar Nov 14 '22 21:11 vpaprotsk

Hi @TobiHartmann you had mentioned there were some more tests to run? Looking to see what else needs fixing. Thanks.

Sure, I re-submitted testing.

EDIT: I see that Vladimir already did that.

TobiHartmann avatar Nov 15 '22 06:11 TobiHartmann

@iwanowww Answered your review comments, please take a look again? Thanks again!

vpaprotsk avatar Nov 16 '22 22:11 vpaprotsk

@iwanowww Hope the extra tests passed? (Or do you have to re-run them on the latest patch again?)

vpaprotsk avatar Nov 21 '22 17:11 vpaprotsk

The test results look good. (Had to wait until testing is complete.)

iwanowww avatar Nov 21 '22 18:11 iwanowww

/integrate

vpaprotsk avatar Nov 21 '22 19:11 vpaprotsk

@vpaprotsk Your change (at version 08ea45e5d4ab071a56dd1a50d517dc3ed4428553) is now ready to be sponsored by a Committer.

openjdk[bot] avatar Nov 21 '22 19:11 openjdk[bot]

/sponsor

sviswa7 avatar Nov 21 '22 20:11 sviswa7