curve25519-dalek
curve25519-dalek copied to clipboard
Implement ABGLSV-Pornin multiplication
Adds a backend for computing δ(aA + bB - C)
in variable time, where:
-
B
is the Ed25519 basepoint; -
δ
is a value invertible modℓ
, which is selected internally to the function.
This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses Algorithm 4 from Pornin 2020 to find a suitable short vector, and then windowed non-adjacent form Straus for the resulting multiscalar multiplication.
References:
- Antipa et al 2005: http://cacr.uwaterloo.ca/techreports/2005/cacr2005-28.pdf
- Pornin 2020: https://eprint.iacr.org/2020/454
This is really cool! A few comments / questions based on a quick read of the source code:
-
I'm a little uncertain about the type signature of
vartime_triple_scalar_mul_basepoint
. Specifically, it returns anEdwardsPoint
with the value of[δa]A + [δb]B - [δ]C
, but sinceδ
is selected internally to the function and isn't returned, this value isn't useful to the caller. All that they can usefully do is check whether the result is the identity (maybe after multiplication by the cofactor).So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check:
/// Checks whether \\([a]A + b[B] = C\\) in variable time. pub fn vartime_check_double_scalar_mul_basepoint( a: &Scalar, A: &EdwardsPoint, b: &Scalar, C: &EdwardsPoint, ) -> bool { /* ... */ }
This can use a triple-base function like the one you already have internally, but the advantage is that in the exposed API surface of the crate, we only commit to the functionality, and not the method. The
EdwardsPoint
version of the API can do a cofactor multiplication on the result of the internal function, but theRistrettoPoint
version doesn't have to. -
If we conceptualize it that way, an obvious second question is: what is the purpose of
vartime_double_scalar_mul_basepoint
? Well, it was exposed for the sole purpose of providing the functionality ofvartime_check_double_scalar_mul_basepoint
, but because it didn't describe the problem at a high-enough level (it gives an intermediate value that the user is supposed to compare), we can't transparently drop in a better implementation. (Avoiding exactly this kind of problem is the motivation for the change suggested above).However, a code search of all of GitHub reveals that with one exception, every single user of this function could be using the faster check function instead, so I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version. The exception is a Bulletproofs implementation that would be faster if it used the generic multiscalar multiplication implementation. This would require a new major version, but because the breaking change is locally scoped and in the direction of existing code, I don't think this is a big deal provided someone (e.g., me) is willing to do the work of submitting patches to all downstream crates.
-
If
vartime_double_scalar_mul_basepoint
isn't there (or won't be in the future), a third question is: what parts of its implementation are carried over to this one, which parts should be, and which parts can be dropped from the codebase? One thing that sticks out as a carryover is the width-8 NAF lookup tables. These were added to the previous implementation as an easy way to get a slight speedup without much algorithmic work, at the cost of additional memory usage. This cost isn't just a problem for the binary size, but also because larger tables are more expensive to access at runtime, which is why they don't work well for very large multiscalar multiplications. Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory? It would be good to have some empirical numbers, but because they're microbenchmarks it's hard to judge the right decision based on those numbers alone (e.g., filling the entire cache with lookup tables is great for microbenchmarks, but real applications do other work with other data that those tables evict). -
The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using
u128
s and rely on the compiler to lower it to machine code. This may be slightly less efficient, but it has a huge saving in code complexity and maintainability. So I don't think it's necessary to implement two versions of the lattice reduction code for different architectures. It would also be good to fit the types a little better to the problem than justBigInt
, which I'm guessing is what you meant by refining the type toShrinkingInt
?I'm not exactly sure how the code would end up being factored across the different backends, but my guess would be that it would look something like:
- backend-agnostic code to do the lattice reduction, maybe in the
scalar
subtree; - an implementation of the triple-base mul in the
backend::serial
subtree; - an implementation of the triple-base mul in the
backend::vector
subtree; Does that seem right?
- backend-agnostic code to do the lattice reduction, maybe in the
So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check: The EdwardsPoint version of the API can do a cofactor multiplication on the result of the internal function, but the RistrettoPoint version doesn't have to.
I did consider this API, but wasn't sure whether there were any cases where we would want to not use cofactor multiplication on the result for EdwardsPoint
. If we're happy making that an internal consideration (or maybe a boolean flag or parallel API), then yes this is definitely simpler.
I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version.
Yep, this is also what I'd like. I kept the prior API initially so I had something to benchmark against 😄
Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory?
It looks like in (Curve9697) @pornin uses width-4 for runtime-calculated tables, and width-5 for pre-computed tables. IDK if he has relevant benchmarks, but it's another datapoint towards dropping width-8. I think it would make sense to examine this in a subsequent PR, separate to this change.
The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using
u128
s and rely on the compiler to lower it to machine code.
The input preparation is performance-critical, in that the pre-Pornin algorithms were slow enough that the reduction in doublings could not offset it (which led to the Ed25519 paper dismissing ABGLSV and using double-base scalar mult instead). That said, a u128
-based impl may still be performant enough to retain the overall benefit on platforms that would normally use the u32
backend.
I had originally started writing BigInt64
using u128
s, but was not sure whether that was acceptable inside the u64
backend. I'll rework it as a backend-independent implementation instead.
It would also be good to fit the types a little better to the problem than just
BigInt
, which I'm guessing is what you meant by refining the type toShrinkingInt
?
Yes, this is what I mean. We want to leverage the fact that bit lengths are strictly-decreasing, to avoid operating on higher limbs that are guaranteed to not contain the MSB.
I've reworked the PR following @hdevalence's comments, and added an AVX2 backend.
abglsv_pornin::mul
is passing the fixed test case for both serial and AVX2, but is occasionally failing the random values tests (often enough that the tests now fail consistently due to trying 100 sets of random values).
About window sizes: there are several parameters in play, not all of which apply to the present case; notably, my default implementations strive to work on very small systems, and that means using very little RAM. For Curve9767, each point in a window uses 80 bytes (affine coordinates, each coordinate is a polynomial of degree less than 19, coefficients on 16 bits each, two dummy slots for alignment); if the windows collectively contain 16 points (for instance), then that's 1280 bytes of stack space, and for the very low-end of microcontrollers, that's too much (I must leave a few hundred bytes for the temporaries used in field element operations, and the calling application may also have needs). ROM/Flash size is also a constraint (though usually less severe), again encouraging using relatively small windows.
With a window of n bits, 2n-1 points must be stored (e.g. for a 16-bit window, this stores points P, 2P,... 8P, from which we can also dynamically obtain -P, -2P,... -8P). If using wNAF, we only need the odd multiplies of these points (i.e. P, 3P, 5P and 7P for a 16-bit window), lowering the storage cost to 2n-2 points. In the signature verification, I have two dynamic windows to store: computing uA+vB+wC, with B being the generator but A and C dynamically obtained, I need one window for A and another for C. Therefore, if I want to use only 8 points (640 stack bytes), then I must stick to 4-bit windows. Static windows are in ROM, and there's more space there, but there's exponential growth; each 5-bit window is 1280 bytes, and there are two of them, so 2560 bytes of ROM for these.
In the x86/AVX2 implementation, for signature verification, I use 5-bit dynamic windows, and 7-bit static windows; for generic point multiplication (non-NAF, thus with also the even multiples), I have both static and dynamic 5-bit windows (four static windows for the base point). The static windows add up to 10240 bytes, which I think is a reasonable figure for big x86, since there will typically be about 32 kB of L1 cache: again, we must think that the caller also has data in cache, and if we use up all the L1 cache for the signature verification, this may look well on benchmarks, but in practical situations this will induce cache misses down the line. We should therefore strive to use only a minority of the total L1 cache.
Note that Ed25519 points are somewhat larger than Curve9767 points: AffineNielsPoint
is three field elements (so at least 96 bytes, possibly more depending on internal representation), while ProjectiveNielsPoint
is four field elements (at least 128 bytes). Dynamic windows will use the latter.
About CPU cost: this is a matter of trade-offs. In wNAF, with n-bit windows, building the window will require 2n-2-1 point additions, and will induce on average one point addition every n+1 bits. With 127-bit multipliers, this means that 4-bit windows need 28.4 point additions on average (for each window, not counting the 126 doublings), while 5-bit windows need about 28.2. With Curve9767, the latter is better (if you have the RAM) for another reason which is not applicable to Ed25519: long sequences of point doublings are slightly more efficient, and longer windows increase the average length of runs of doublings. This benefit does not apply to Ed25519. Thus, for dynamic windows and Ed25519, I'd say that 4-bit and 5-bit wNAF windows should be about equivalent (5-bit windows would be better if using 252-bit multipliers).
With static windows, there is no CPU cost in building windows, and larger windows are better, but there are diminishing returns. Going from 7-bit to 8-bit windows would save less than two point additions, possibly not worth the effort unless you are aiming at breaking the record in a microbenchmark context which will be meaningless in real situations.
Force-pushed to fix the serial and vector Straus impls, which were not correctly checking for the first non-zero d_1
bit. The tests now pass.
This PR was previously based on release 2.0.0. I've rebased it onto release/4.0
as that's where development work is currently directed, but I can also rebase instead onto release/3.2
if there is a desire to get this out for existing users (as the API introduced by this PR is a pure addition).
Assuming this PR is merged, I plan to make a separate PR to release/4.0
removing vartime_double_scalar_mul_basepoint
and naming this API as the replacement per the above discussion (as the API introduced by this PR is around 13% faster).
Fixed the simple bugs, but CI is still failing because between 2.0.0 and release/4.0
a bunch more backends were added to CI, and now I need to generate [2^128] B
tables for them all.
Moving the todo list out of the top post:
- [x] Rework external APIs to check an equality instead of exposing the output of
abglsv_pornin::mul
. - [x] Add RistrettoElement API.
- [x] Add AVX2 backend
- [ ] Replace u64 BigInt with a u128-based implementation.
- [ ] Convert
BigInt
intoShrinkingInt
to take advantage of the strictly-decreasing bit lengths withinabglsv_pornin::mul
.
The last two items are not blockers for this PR.
Moving the old rough benchmarks (Xeon E-2176M CPU @ 2.70GHz, no attempt at disabling hyperthreading) out of the top post:
Variable-time aA+bB, A variable, B fixed
time: [42.769 us 42.924 us 43.100 us]
change: [-0.8402% +0.5668% +1.9383%] (p = 0.44 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe
Variable-time δ(aA+bB-C), A&C variable, B fixed
time: [37.051 us 37.328 us 37.603 us]
change: [-0.4643% +0.2679% +0.9958%] (p = 0.48 > 0.05)
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
And some more rough benchmarks:
AMD Ryzen 9 5950X
Serial: 17.2% decrease
Variable-time aA+bB, A variable, B fixed
time: [34.671 µs 34.687 µs 34.704 µs]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Variable-time 8(aA+bB)=8C, A&C variable, B fixed
time: [28.711 µs 28.734 µs 28.765 µs]
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
AVX2: 7.7% decrease
Change percentages are relative to the serial run above.
Variable-time aA+bB, A variable, B fixed
time: [22.732 µs 22.749 µs 22.767 µs]
change: [-34.507% -34.458% -34.409%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Variable-time 8(aA+bB)=8C, A&C variable, B fixed
time: [20.992 µs 21.004 µs 21.016 µs]
change: [-27.054% -26.997% -26.939%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Macbook Air M1
Serial: 14% decrease
Variable-time aA+bB, A variable, B fixed
time: [29.679 µs 29.708 µs 29.738 µs]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
Variable-time 8(aA+bB)=8C, A&C variable, B fixed
time: [25.526 µs 25.551 µs 25.581 µs]
Found 17 outliers among 100 measurements (17.00%)
6 (6.00%) high mild
11 (11.00%) high severe
I'm going to rebase this onto main
shortly, and if I have time I'll address the other two (non-blocking) issues raised above.
Rebased on main
and fixed the resulting merge conflicts.
Force-pushed to fix post-rebase bugs and get CI passing.
Force-pushed to add changelog entries and fix documentation.
Force-pushed to move the new generated serial tables into separate submodules, and added cfg-flagged tests to generate them, and a CI job that verifies them. If this works, I'll attempt to replicate this for the vector tables.
Force-pushed to fix the Fiat backends, and adjust the new CI check to fail if the table generators do nothing (as they generate output that is incorrectly formatted, and thus detectable).
Force-pushed to implement a similar kind of generator approach for the AVX2 vector table. It doesn't currently work because the Debug
impl for u32x8
doesn't print out values that are suitable for use in its constructor; additional post-processing will be required.
Force-pushed to fix the AVX2 table generator. The generated constant is concretely different from before (I presume something changed about the wNAF implementation in the intervening four years), but tests pass before and after the change (and I checked that mutating either version of the constant causes a test to fail).
Force-pushed to implement a generator for the IFMA vector table, based on the working AVX2 generator. It should work, but I don't have the hardware to run it, and so the IFMA constants remain invalid. Someone with compatible hardware needs to run the following commands on this branch:
$ RUSTFLAGS="--cfg curve25519_dalek_generate_tables" cargo +nightly test --all-features table_generators
$ cargo fmt
and then provide the resulting diff to the IFMA table.