jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

Open vpaprotsk opened this issue 10 months ago • 14 comments

Performance. Before:

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score    Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6443.934 ±  6.491  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6152.979 ±  4.954  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1895.410 ± 36.979  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1878.955 ± 45.487  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1357.810 ± 26.584  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1352.119 ± 23.547  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply          false  thrpt    3  1746.126 ± 10.970  ops/s

Performance, no intrinsic:

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt     Score     Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  6529.839 ±  42.420  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3  6199.747 ± 133.566  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3  1973.676 ±  54.071  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3  1932.127 ±  35.920  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  1355.788 ± 29.858  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  1346.523 ± 28.722  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply           true  thrpt    3  1919.574 ± 10.591  ops/s

Performance, with intrinsics

Benchmark                        (algorithm)  (dataSize)  (keyLength)  (provider)   Mode  Cnt      Score     Error  Units
SignatureBench.ECDSA.sign    SHA256withECDSA        1024          256              thrpt    3  10384.591 ±  65.274  ops/s
SignatureBench.ECDSA.sign    SHA256withECDSA       16384          256              thrpt    3   9592.912 ± 236.411  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA        1024          256              thrpt    3   3479.494 ±  44.578  ops/s
SignatureBench.ECDSA.verify  SHA256withECDSA       16384          256              thrpt    3   3402.147 ±  26.772  ops/s
Benchmark                                            (algorithm)  (keyLength)  (kpgAlgorithm)  (provider)   Mode  Cnt     Score    Error  Units
o.o.b.j.c.full.KeyAgreementBench.EC.generateSecret          ECDH          256              EC              thrpt    3  2527.678 ± 64.791  ops/s
o.o.b.j.c.small.KeyAgreementBench.EC.generateSecret         ECDH          256              EC              thrpt    3  2541.258 ± 66.634  ops/s
Benchmark                          (isMontBench)   Mode  Cnt     Score    Error  Units
PolynomialP256Bench.benchMultiply           true  thrpt    3  3021.139 ± 98.289  ops/s

Summary on design (see code for 'ASCII art', references and details on math):

  • Added a new IntegerPolynomial field (MontgomeryIntegerPolynomialP256) with 52-bit limbs
    • getElement(*)/fromMontgomery() to convert numbers into/out of the field
  • ECOperations is the primary use of the new field
    • flattened some extra deep nested class hierarchy (also in prep for further other field optimizations)
    • forParameters()/multiply()/setSum() generates numbers in the new field
  • ProjectivePoint/Montgomery{Imm|M}utable.asAffine() to convert out of the new field
  • Added Fuzz Testing and KAT verified with OpenSSL

Progress

  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [x] Change must be properly reviewed (3 reviews required, with at least 1 Reviewer, 2 Authors)

Issue

  • JDK-8329538: Accelerate P256 on x86_64 using Montgomery intrinsic (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18583/head:pull/18583
$ git checkout pull/18583

Update a local copy of the PR:
$ git checkout pull/18583
$ git pull https://git.openjdk.org/jdk.git pull/18583/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18583

View PR using the GUI difftool:
$ git pr show -t 18583

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18583.diff

Webrev

Link to Webrev Comment

vpaprotsk avatar Apr 02 '24 15:04 vpaprotsk

:wave: Welcome back vpaprotsk! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Apr 02 '24 15:04 bridgekeeper[bot]

@vpaprotsk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8329538: Accelerate P256 on x86_64 using Montgomery intrinsic

Reviewed-by: ihse, ascarpino, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 8 new commits pushed to the master branch:

  • 9ca90ccd6bfec76e54e2e870bd706fad5abf233c: 8332610: Remove unused nWakeups in ObjectMonitor
  • 92d33501e091bdfaab52886078053b849a5a8f68: 8331920: ubsan: g1CardSetContainers.inline.hpp:266:5: runtime error: index 2 out of bounds for type 'G1CardSetHowl::ContainerPtr [2]' reported
  • 4f1a10f84bcfadef263a0890b6834ccd3d5bb52f: 8332360: JVM hangs at exit when running on a uniprocessor
  • c3bc23fe48ca1603afe68a6ac4aaa523a1edbb41: 8326306: RISC-V: Re-structure MASM calls and jumps
  • 8a9d77d58de259b6b2bdc2cc9e7bfdc28dcf7165: 8320622: [TEST] Improve coverage of compiler/loopopts/superword/TestMulAddS2I.java on different platforms
  • 3d511ff63e59f542ae20c722bfef1c867cd1da0e: 8329748: Change default value of AssertWXAtThreadSync to true
  • 67f03f2a4f5ac12748ffbf5c04f248a60869e180: 8332533: RISC-V: Enable vector variable shift instructions for machines with RVV
  • 5f804b2ec12627b593353ceeab881187b0bb5cd6: 8329825: Clarify the value type for java.net.SocketOptions.SO_LINGER

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch. As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@magicus, @jatin-bhateja, @ascarpino, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk[bot] avatar Apr 02 '24 15:04 openjdk[bot]

@vpaprotsk The following labels will be automatically applied to this pull request:

  • build
  • core-libs
  • graal
  • hotspot
  • security

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Apr 02 '24 15:04 openjdk[bot]

@magicus The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 3 (with at least 1 Reviewer, 2 Authors).

openjdk[bot] avatar Apr 03 '24 09:04 openjdk[bot]

@ascarpino Hi Tony, this is the ECC P256 PR we talked about last year, would appreciate your feedback.

vpaprotsk avatar Apr 05 '24 17:04 vpaprotsk

In ECOperations.java, if I understand this correctly, it is to replace the existing PointMultiplier with montgomery-based PointMuliplier. But when I look at the code, I see both are still options. If I read this correctly, it checks for the old IntegerFieldModuloP, then looks for the new IntegerMontgomeryFieldModuloP. It appears to use the new one always. Why doesn't it just replace the old implementation entry in the fields Map? Is there a reason to keep it around?

ascarpino avatar Apr 10 '24 17:04 ascarpino

In ECOperations.java, if I understand this correctly, it is to replace the existing PointMultiplier with montgomery-based PointMuliplier. But when I look at the code, I see both are still options. If I read this correctly, it checks for the old IntegerFieldModuloP, then looks for the new IntegerMontgomeryFieldModuloP. It appears to use the new one always. Why doesn't it just replace the old implementation entry in the fields Map? Is there a reason to keep it around?

Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for CurveDB.P_256 remove the fallback path to non-montgomery entirely".. that might also help in cleaning a few things up in the construction. Maybe even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the ECC stack clearer is positive!

One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code.

Can it be removed? For the operations that do not involve multiplication (i.e. setSum(*)), montgomery is expensive. I think I did go through the uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing IntegerPolynomialP256 is no longer used (I should verify that again) and only P256OrderField remains non-montgomery. So removing references to IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some thought..

I tend to like ECOperationsFuzzTest.java and would prefer to keep it, but it could also be chucked up as part of 'scaffolding' and removed in name of code quality?

Thanks @ascarpino

PS: Perhaps there is some middle ground, remove the ECOperations montgomeryOps nesting, and construct (somehow?? singleton makes most things inaccessible..) the reference ECOperations in the fuzz test instead.. not sure how yet, but perhaps worth a further thought..

vpaprotsk avatar Apr 10 '24 18:04 vpaprotsk

Few early comments.

Please update the copyright year of all the modified files.

You can even consider splitting this into two patches, Java side changes in one and x86 optimized intrinsic in next one.

Thanks Jatin, will fix!

vpaprotsk avatar Apr 10 '24 23:04 vpaprotsk

In ECOperations.java, if I understand this correctly, it is to replace the existing PointMultiplier with montgomery-based PointMuliplier. But when I look at the code, I see both are still options. If I read this correctly, it checks for the old IntegerFieldModuloP, then looks for the new IntegerMontgomeryFieldModuloP. It appears to use the new one always. Why doesn't it just replace the old implementation entry in the fields Map? Is there a reason to keep it around?

Hmm, thats a good point I haven't fully considered; i.e. (if I read correctly) "for CurveDB.P_256 remove the fallback path to non-montgomery entirely".. that might also help in cleaning a few things up in the construction. Maybe even get rid of this nested ECOperations inside ECOperations.. Perhaps nesting isnt a big deal, but all attempts to make the ECC stack clearer is positive!

One functional reason that might justify keeping it as-is, is fuzz-testing; with the fallback available, I am able to write the included Fuzz tests and have them check the values against the existing implementation. While I also included a few KAT tests using openssl-generated values, the fuzz tests check millions of values and it does add a lot more certainty about correctness of this code.

I hadn't looked at your fuzz test until you mentioned it. I see you are using reflection to change the values. Is that what you mean by "fallback"? I'm assuming there is no to access the older implementation without reflection.

Can it be removed? For the operations that do not involve multiplication (i.e. setSum(*)), montgomery is expensive. I think I did go through the uses of this code some time back (i.e. ECDHE, ECDSA and KeyGeneration) and existing IntegerPolynomialP256 is no longer used (I should verify that again) and only P256OrderField remains non-montgomery. So removing references to IntegerPolynomialP256 in ECOperations should be possible and cleaner. Removing IntegerPolynomialP256 from MontgomeryIntegerPolynomialP256 is harder (fromMontgomery() uses IntegerPolynomialP256) but perhaps also worth some thought..

I tend to like ECOperationsFuzzTest.java and would prefer to keep it, but it could also be chucked up as part of 'scaffolding' and removed in name of code quality?

I wouldn't rip out the old implementation. I have been wondering if we should make the older implementation available, maybe by security property. I was looking at the static Maps at the top of ECOperations, forParameters, and the constructors where it checks if the montgomeryOps was null or set. It would be nice if we could have one set of fields Maps by putting the montgomery entry into the fields to replace it. I think that should work because IntegerMontgomeryFieldModuloP extends IntegerFieldModuloP. instanceof or other montgomeryOps checks would still need to exist because not all the fields support mongomery, and the older implementation would still be accessible for your fuzz tester. At least that is my theory.

Thanks @ascarpino

PS: Perhaps there is some middle ground, remove the ECOperations montgomeryOps nesting, and construct (somehow?? singleton makes most things inaccessible..) the reference ECOperations in the fuzz test instead.. not sure how yet, but perhaps worth a further thought..

It would be nice to remove the nesting and it would be nice to be a singleton. Maybe some combination of what I mentioned above chance can help that. I haven't fully thought this out either.

ascarpino avatar Apr 11 '24 17:04 ascarpino

Few early comments.

Please update the copyright year of all the modified files.

You can even consider splitting this into two patches, Java side changes in one and x86 optimized intrinsic in next one.

Fixed all copyright years

git diff da8a095a19c90e7ee2b45fab9b533a1092887023 | lsdiff -p1 | while read line; do echo $line =========================; grep Copyright $line | grep -v 2024; done | less

Re splitting.. probably too late for that now.. (did consider it initially.. got hard to manage two changes while developing. And easier to justify the change when the entire patch is presented? but yes, far more code to review.. )

vpaprotsk avatar Apr 15 '24 22:04 vpaprotsk

@ascarpino Fixed as suggested... actually.. that was waaay easier then I thought it would be

(I saw singleton and assumed private constructor.. nope, ECOperations() is public, no reflection required!! Ended up with cleaner implementation and cleaner tests! Thanks!)

Sidenote: Unlike the hashtable you suggested I use, I removed the other hashtable (multiplier hashtable).. the multiplier hashtable would trigger classloading of multipliers when ECOperations itself is loaded. By moving it into if statements, loading of multipliers is delayed until function call (i.e. loading of multipliers causes the table computation). That let me remove one layer of class nesting that existed before.

vpaprotsk avatar Apr 15 '24 22:04 vpaprotsk

The intrinsics and the C2 changes look good to me.

sviswa7 avatar May 17 '24 21:05 sviswa7

Thanks Sandhya!

Now that I have @ascarpino approval as well, I plan to integrate next Tuesday.

vpaprotsk avatar May 17 '24 22:05 vpaprotsk

I'll send this through our testing and will report back once it passed.

TobiHartmann avatar May 21 '24 07:05 TobiHartmann

I'm getting some conflicts when trying to apply this to master. Could you please merge the PR?

TobiHartmann avatar May 21 '24 07:05 TobiHartmann

Hi @TobiHartmann , merged with no issues for me. Could you please run the tests again? (I think Tony did run them, but can't hurt verifying again). Thanks!

vpaprotsk avatar May 21 '24 17:05 vpaprotsk

Thanks! I submitted testing and will report back once it passed.

TobiHartmann avatar May 22 '24 05:05 TobiHartmann

All tests passed.

TobiHartmann avatar May 22 '24 14:05 TobiHartmann

Thanks Tobi!

/integrate

vpaprotsk avatar May 22 '24 14:05 vpaprotsk

@vpaprotsk Your change (at version b1a3300440581c5ed42c9bac7cfb9a6ee373c6d7) is now ready to be sponsored by a Committer.

openjdk[bot] avatar May 22 '24 14:05 openjdk[bot]

/sponsor

sviswa7 avatar May 22 '24 16:05 sviswa7

Going to push as commit afed7d0b0593864e5595840a6b645c210ff28c7c. Since your change was applied there have been 8 commits pushed to the master branch:

  • 9ca90ccd6bfec76e54e2e870bd706fad5abf233c: 8332610: Remove unused nWakeups in ObjectMonitor
  • 92d33501e091bdfaab52886078053b849a5a8f68: 8331920: ubsan: g1CardSetContainers.inline.hpp:266:5: runtime error: index 2 out of bounds for type 'G1CardSetHowl::ContainerPtr [2]' reported
  • 4f1a10f84bcfadef263a0890b6834ccd3d5bb52f: 8332360: JVM hangs at exit when running on a uniprocessor
  • c3bc23fe48ca1603afe68a6ac4aaa523a1edbb41: 8326306: RISC-V: Re-structure MASM calls and jumps
  • 8a9d77d58de259b6b2bdc2cc9e7bfdc28dcf7165: 8320622: [TEST] Improve coverage of compiler/loopopts/superword/TestMulAddS2I.java on different platforms
  • 3d511ff63e59f542ae20c722bfef1c867cd1da0e: 8329748: Change default value of AssertWXAtThreadSync to true
  • 67f03f2a4f5ac12748ffbf5c04f248a60869e180: 8332533: RISC-V: Enable vector variable shift instructions for machines with RVV
  • 5f804b2ec12627b593353ceeab881187b0bb5cd6: 8329825: Clarify the value type for java.net.SocketOptions.SO_LINGER

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar May 22 '24 16:05 openjdk[bot]

@sviswa7 @vpaprotsk Pushed as commit afed7d0b0593864e5595840a6b645c210ff28c7c.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar May 22 '24 16:05 openjdk[bot]

Unfortunately, this caused a performance regression, see JDK-8333583. @vpaprotsk, please have a look.

TobiHartmann avatar Jun 05 '24 11:06 TobiHartmann