curve25519-donna icon indicating copy to clipboard operation
curve25519-donna copied to clipboard

Add an optimized curve25519_base_donna

Open nmathewson opened this issue 11 years ago • 4 comments

In many protocols, it doesn't hurt security for a timing attacker to know whether we're computing SK_9 ==> PUBKEY or whether we're computing SK_PUBKEY ==> SHARED_SECRET. For example, when you're doing Diffie-Hellman, everybody will do one scalarmult of the basepoint (9), and one scalarmult of the other party's public key.

For this kind of application, we can improve handshake speed by having an optimized curve25519_base_donna function (name taken from nacl's "crypto_scalarmult_base") that computes SK*BP.

Here I've only done the most obvious version of the optimization by making an alternate fmonty_bp that does a scalar multiplication by 9 in place of fmul(.,.,qmqp). On my laptop, the 64-bit curve25519_base_donna is 11.5% faster than the regular 64-bit curve25519_donna, and the 32-bit curve25519_base_donna is faster than its counterpart by 4.7%.

nmathewson avatar Dec 23 '12 19:12 nmathewson

I'm on vacation at the moment and I'm afraid it'll be a file of weeks till I'm properly back to normal and reviewing.

However, fixed base optimisations are typically much faster than this. If the time taken to do a key gen is a problem then I can probably do better, although with more code.

Are you thinking of using curve 25519 for ECDHE where key gen is common? On Dec 23, 2012 9:38 AM, "Nick Mathewson" [email protected] wrote:

In many protocols, it doesn't hurt security for a timing attacker to know whether we're computing SK_9 ==> PUBKEY or whether we're computing SK_PUBKEY ==> SHARED_SECRET. For example, when you're doing Diffie-Hellman, everybody will do one scalarmult of the basepoint (9), and one scalarmult of the other party's public key.

For this kind of application, we can improve handshake speed by having an optimized curve25519_base_donna function (name taken from nacl's "crypto_scalarmult_base") that computes SK*BP.

Here I've only done the most obvious version of the optimization by making an alternate fmonty_bp that does a scalar multiplication by 9 in place of fmul(.,.,qmqp). On my laptop, the 64-bit curve25519_base_donna is 11.5% faster than the regular 64-bit curve25519_donna, and the 32-bit curve25519_base_donna is faster than

its counterpart by 4.7%.

You can merge this Pull Request by running:

git pull https://github.com/nmathewson/curve25519-donna bp_opt

Or view, comment on, or merge it at:

https://github.com/agl/curve25519-donna/pull/23 Commit Summary

  • Add an optimized curve25519_base_donna

File Changes

  • M .gitignore (2)
  • M Makefile (14)
  • M curve25519-donna-c64.c (114)
  • M curve25519-donna.c (130)
  • M speed-curve25519.c (9)
  • A test-curve25519-bp.c (32)
  • M test-curve25519.c (2)

Patch Links

  • https://github.com/agl/curve25519-donna/pull/23.patch

  • https://github.com/agl/curve25519-donna/pull/23.diff

    — Reply to this email directly or view it on GitHubhttps://github.com/agl/curve25519-donna/pull/23.

agl avatar Dec 30 '12 06:12 agl

I've got a Tor branch where I'm using it for Ian Goldberg et al's "ntor" protocol, as described in their paper and specified in Tor proposal 216.

The server side of the protocol is performance-critical for me; it involves 3 curve25519 operations, one of which is an operation on the base point.

If you think it's something I could figure out, and you point me at something readable about how to do fixed base optimizations right, I might be able to take a crack at it.

The other conceivable optimization to do for that protocol would be to implement something like Shamir's simultaneous multiple exponentiation -- but that one isn't at all an obvious fit for how curve25519 wants to be implemented.

Alternatively, there's a protocol for a faster protocol out there, but it's less reviewed than ntor, and it would require point addition as well as simultaneous multiple curve25519.

nmathewson avatar Dec 30 '12 12:12 nmathewson

On Dec 30, 2012 2:54 AM, "Nick Mathewson" [email protected] wrote:

If you think it's something I could figure out, and you point me at something readable about how to do fixed base optimizations right, I might be able to take a crack at it.

Firstly, the biggest improvement is probably to grab djb's code from SUPERCOP and use that when building for amd64 / x86 / arm. (Since I wrote donna he's added amd64 code and it's faster than donna, the ARM code is certainly much better.)

But, as for adding fixed-base optimisations to donna, I suggest that we find Dan in a week and pin him down. (I think we're all at End of RSA / Real World Crypto.)

I can't find plain addition formula for Montgomery curves and I don't think there are any, so there are two obvious paths: work in Short Weierstrass form with Jacobian coordinates or work in twisted Edwards form with extended coordinates. I think the latter will be cleaner and I think both can be done without extending the field. Indeed, the conversion to SW seems to be just to addition of A/3 to the x-coordinate, and ed25519 uses a twisted Edward's version of curve25519 so the isomorphism must exist.

From there, one can precompute a table of values tapping two or four bit positions and so do a fixed based scalar mult with 64 adds and 64 doubles (with four taps). With a second table that's out of phase, it can be 64 adds and 32 doubles as the cost of more memory bandwidth.

Reading the table in constant time is a bit of a pain. Typically I've always read the whole table and used bit masks to select the desired value. However, if you're willing to have some platform specific code in there then knowledge of the cache line length can speed that up.

Cheers

AGL

agl avatar Dec 30 '12 14:12 agl

I'll see about snarfing the SUPERCOP implementations; I was kind of hoping that the next version of nacl would come out sometime soon, including those implementations, so I wouldn't have to worry about it and could just tell people "link against nacl." The schedule for that is a fine thing for us to ask Dan about when we're co-located.

WRT better implementations for implementing the ntor/ace handshakes: this is definitely a place to step back and talk about what the correct long-term solution really is. But even with the current naive implementation, ntor+curve25519-donna is much faster and much more secure than what Tor's been doing up till now, so I'm fine merging it and then optimizing it or moving to ace later.

nmathewson avatar Dec 30 '12 16:12 nmathewson