curve25519-dalek Implement ABGLSV-Pornin multiplication

Adds a backend for computing δ(aA + bB - C) in variable time, where:

B is the Ed25519 basepoint;
δ is a value invertible mod ℓ, which is selected internally to the function.

This corresponds to the signature verification optimisation presented in Antipa et al 2005. It uses Algorithm 4 from Pornin 2020 to find a suitable short vector, and then windowed non-adjacent form Straus for the resulting multiscalar multiplication.

References:

Antipa et al 2005: http://cacr.uwaterloo.ca/techreports/2005/cacr2005-28.pdf
Pornin 2020: https://eprint.iacr.org/2020/454

May 04 '20 14:05 str4d

This is really cool! A few comments / questions based on a quick read of the source code:

I'm a little uncertain about the type signature of vartime_triple_scalar_mul_basepoint. Specifically, it returns an EdwardsPoint with the value of [δa]A + [δb]B - [δ]C, but since δ is selected internally to the function and isn't returned, this value isn't useful to the caller. All that they can usefully do is check whether the result is the identity (maybe after multiplication by the cofactor).

So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check:
```
/// Checks whether \\([a]A + b[B] = C\\) in variable time.
pub fn vartime_check_double_scalar_mul_basepoint(
        a: &Scalar,
        A: &EdwardsPoint,
        b: &Scalar,
        C: &EdwardsPoint,
    ) -> bool { /* ... */ }
```
This can use a triple-base function like the one you already have internally, but the advantage is that in the exposed API surface of the crate, we only commit to the functionality, and not the method. The EdwardsPoint version of the API can do a cofactor multiplication on the result of the internal function, but the RistrettoPoint version doesn't have to.
If we conceptualize it that way, an obvious second question is: what is the purpose of vartime_double_scalar_mul_basepoint? Well, it was exposed for the sole purpose of providing the functionality of vartime_check_double_scalar_mul_basepoint, but because it didn't describe the problem at a high-enough level (it gives an intermediate value that the user is supposed to compare), we can't transparently drop in a better implementation. (Avoiding exactly this kind of problem is the motivation for the change suggested above).

However, a code search of all of GitHub reveals that with one exception, every single user of this function could be using the faster check function instead, so I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version. The exception is a Bulletproofs implementation that would be faster if it used the generic multiscalar multiplication implementation. This would require a new major version, but because the breaking change is locally scoped and in the direction of existing code, I don't think this is a big deal provided someone (e.g., me) is willing to do the work of submitting patches to all downstream crates.
If vartime_double_scalar_mul_basepoint isn't there (or won't be in the future), a third question is: what parts of its implementation are carried over to this one, which parts should be, and which parts can be dropped from the codebase? One thing that sticks out as a carryover is the width-8 NAF lookup tables. These were added to the previous implementation as an easy way to get a slight speedup without much algorithmic work, at the cost of additional memory usage. This cost isn't just a problem for the binary size, but also because larger tables are more expensive to access at runtime, which is why they don't work well for very large multiscalar multiplications. Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory? It would be good to have some empirical numbers, but because they're microbenchmarks it's hard to judge the right decision based on those numbers alone (e.g., filling the entire cache with lookup tables is great for microbenchmarks, but real applications do other work with other data that those tables evict).
The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using u128s and rely on the compiler to lower it to machine code. This may be slightly less efficient, but it has a huge saving in code complexity and maintainability. So I don't think it's necessary to implement two versions of the lattice reduction code for different architectures. It would also be good to fit the types a little better to the problem than just BigInt, which I'm guessing is what you meant by refining the type to ShrinkingInt?

I'm not exactly sure how the code would end up being factored across the different backends, but my guess would be that it would look something like:
- backend-agnostic code to do the lattice reduction, maybe in the scalar subtree;
- an implementation of the triple-base mul in the backend::serial subtree;
- an implementation of the triple-base mul in the backend::vector subtree; Does that seem right?

May 04 '20 16:05 hdevalence

So, perhaps the API would be simpler if, rather than being conceptualized as a new triple-base scalar multiplication, it was conceptualized as a more efficient implementation of the check: The EdwardsPoint version of the API can do a cofactor multiplication on the result of the internal function, but the RistrettoPoint version doesn't have to.

I did consider this API, but wasn't sure whether there were any cases where we would want to not use cofactor multiplication on the result for EdwardsPoint. If we're happy making that an internal consideration (or maybe a boolean flag or parallel API), then yes this is definitely simpler.

I think we got the API wrong, and if this code is merged, I would like to deprecate and/or remove the non-checked version.

Yep, this is also what I'd like. I kept the prior API initially so I had something to benchmark against 😄

Now that the algorithm is different, is there still a significant benefit to using the width-8 instead of width-5 tables? Can we use this change as an opportunity to save memory?

It looks like in (Curve9697) @pornin uses width-4 for runtime-calculated tables, and width-5 for pre-computed tables. IDK if he has relevant benchmarks, but it's another datapoint towards dropping width-8. I think it would make sense to examine this in a subsequent PR, separate to this change.

The algorithm requires an implementation of some big-integer arithmetic. Because this is only used once to prepare the inputs to the multiscalar function, it seems like its performance is less critical than other arithmetic, so I wonder whether it would make sense to have a single implementation using u128s and rely on the compiler to lower it to machine code.

The input preparation is performance-critical, in that the pre-Pornin algorithms were slow enough that the reduction in doublings could not offset it (which led to the Ed25519 paper dismissing ABGLSV and using double-base scalar mult instead). That said, a u128-based impl may still be performant enough to retain the overall benefit on platforms that would normally use the u32 backend.

I had originally started writing BigInt64 using u128s, but was not sure whether that was acceptable inside the u64 backend. I'll rework it as a backend-independent implementation instead.

It would also be good to fit the types a little better to the problem than just BigInt, which I'm guessing is what you meant by refining the type to ShrinkingInt?

Yes, this is what I mean. We want to leverage the fact that bit lengths are strictly-decreasing, to avoid operating on higher limbs that are guaranteed to not contain the MSB.

May 04 '20 23:05 str4d

I've reworked the PR following @hdevalence's comments, and added an AVX2 backend.

abglsv_pornin::mul is passing the fixed test case for both serial and AVX2, but is occasionally failing the random values tests (often enough that the tests now fail consistently due to trying 100 sets of random values).

May 05 '20 07:05 str4d

About window sizes: there are several parameters in play, not all of which apply to the present case; notably, my default implementations strive to work on very small systems, and that means using very little RAM. For Curve9767, each point in a window uses 80 bytes (affine coordinates, each coordinate is a polynomial of degree less than 19, coefficients on 16 bits each, two dummy slots for alignment); if the windows collectively contain 16 points (for instance), then that's 1280 bytes of stack space, and for the very low-end of microcontrollers, that's too much (I must leave a few hundred bytes for the temporaries used in field element operations, and the calling application may also have needs). ROM/Flash size is also a constraint (though usually less severe), again encouraging using relatively small windows.

With a window of n bits, 2^n-1 points must be stored (e.g. for a 16-bit window, this stores points P, 2P,... 8P, from which we can also dynamically obtain -P, -2P,... -8P). If using wNAF, we only need the odd multiplies of these points (i.e. P, 3P, 5P and 7P for a 16-bit window), lowering the storage cost to 2^n-2 points. In the signature verification, I have two dynamic windows to store: computing uA+vB+wC, with B being the generator but A and C dynamically obtained, I need one window for A and another for C. Therefore, if I want to use only 8 points (640 stack bytes), then I must stick to 4-bit windows. Static windows are in ROM, and there's more space there, but there's exponential growth; each 5-bit window is 1280 bytes, and there are two of them, so 2560 bytes of ROM for these.

In the x86/AVX2 implementation, for signature verification, I use 5-bit dynamic windows, and 7-bit static windows; for generic point multiplication (non-NAF, thus with also the even multiples), I have both static and dynamic 5-bit windows (four static windows for the base point). The static windows add up to 10240 bytes, which I think is a reasonable figure for big x86, since there will typically be about 32 kB of L1 cache: again, we must think that the caller also has data in cache, and if we use up all the L1 cache for the signature verification, this may look well on benchmarks, but in practical situations this will induce cache misses down the line. We should therefore strive to use only a minority of the total L1 cache.

Note that Ed25519 points are somewhat larger than Curve9767 points: AffineNielsPoint is three field elements (so at least 96 bytes, possibly more depending on internal representation), while ProjectiveNielsPoint is four field elements (at least 128 bytes). Dynamic windows will use the latter.

About CPU cost: this is a matter of trade-offs. In wNAF, with n-bit windows, building the window will require 2^n-2-1 point additions, and will induce on average one point addition every n+1 bits. With 127-bit multipliers, this means that 4-bit windows need 28.4 point additions on average (for each window, not counting the 126 doublings), while 5-bit windows need about 28.2. With Curve9767, the latter is better (if you have the RAM) for another reason which is not applicable to Ed25519: long sequences of point doublings are slightly more efficient, and longer windows increase the average length of runs of doublings. This benefit does not apply to Ed25519. Thus, for dynamic windows and Ed25519, I'd say that 4-bit and 5-bit wNAF windows should be about equivalent (5-bit windows would be better if using 252-bit multipliers).

With static windows, there is no CPU cost in building windows, and larger windows are better, but there are diminishing returns. Going from 7-bit to 8-bit windows would save less than two point additions, possibly not worth the effort unless you are aiming at breaking the record in a microbenchmark context which will be meaningless in real situations.

May 05 '20 12:05 pornin

Force-pushed to fix the serial and vector Straus impls, which were not correctly checking for the first non-zero d_1 bit. The tests now pass.

May 08 '20 22:05 str4d

This PR was previously based on release 2.0.0. I've rebased it onto release/4.0 as that's where development work is currently directed, but I can also rebase instead onto release/3.2 if there is a desire to get this out for existing users (as the API introduced by this PR is a pure addition).

Assuming this PR is merged, I plan to make a separate PR to release/4.0 removing vartime_double_scalar_mul_basepoint and naming this API as the replacement per the above discussion (as the API introduced by this PR is around 13% faster).

Dec 12 '22 11:12 str4d

Fixed the simple bugs, but CI is still failing because between 2.0.0 and release/4.0 a bunch more backends were added to CI, and now I need to generate [2^128] B tables for them all.

Dec 12 '22 12:12 str4d

Moving the todo list out of the top post:

[x] Rework external APIs to check an equality instead of exposing the output of abglsv_pornin::mul.
[x] Add RistrettoElement API.
[x] Add AVX2 backend
[ ] Replace u64 BigInt with a u128-based implementation.
[ ] Convert BigInt into ShrinkingInt to take advantage of the strictly-decreasing bit lengths within abglsv_pornin::mul.

The last two items are not blockers for this PR.

Dec 12 '22 13:12 str4d

Moving the old rough benchmarks (Xeon E-2176M CPU @ 2.70GHz, no attempt at disabling hyperthreading) out of the top post:

Variable-time aA+bB, A variable, B fixed                        
                        time:   [42.769 us 42.924 us 43.100 us]
                        change: [-0.8402% +0.5668% +1.9383%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Variable-time δ(aA+bB-C), A&C variable, B fixed                         
                        time:   [37.051 us 37.328 us 37.603 us]
                        change: [-0.4643% +0.2679% +0.9958%] (p = 0.48 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Dec 12 '22 13:12 str4d

And some more rough benchmarks:

AMD Ryzen 9 5950X

Serial: 17.2% decrease

Variable-time aA+bB, A variable, B fixed
                        time:   [34.671 µs 34.687 µs 34.704 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Variable-time 8(aA+bB)=8C, A&C variable, B fixed
                        time:   [28.711 µs 28.734 µs 28.765 µs]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

AVX2: 7.7% decrease

Change percentages are relative to the serial run above.

Variable-time aA+bB, A variable, B fixed
                        time:   [22.732 µs 22.749 µs 22.767 µs]
                        change: [-34.507% -34.458% -34.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Variable-time 8(aA+bB)=8C, A&C variable, B fixed
                        time:   [20.992 µs 21.004 µs 21.016 µs]
                        change: [-27.054% -26.997% -26.939%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Macbook Air M1

Serial: 14% decrease

Variable-time aA+bB, A variable, B fixed
                        time:   [29.679 µs 29.708 µs 29.738 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Variable-time 8(aA+bB)=8C, A&C variable, B fixed
                        time:   [25.526 µs 25.551 µs 25.581 µs]
Found 17 outliers among 100 measurements (17.00%)
  6 (6.00%) high mild
  11 (11.00%) high severe

Dec 12 '22 13:12 str4d

I'm going to rebase this onto main shortly, and if I have time I'll address the other two (non-blocking) issues raised above.

Mar 28 '24 19:03 str4d

Rebased on main and fixed the resulting merge conflicts.

Mar 28 '24 21:03 str4d

Force-pushed to fix post-rebase bugs and get CI passing.

Mar 29 '24 13:03 str4d

Force-pushed to add changelog entries and fix documentation.

Mar 29 '24 14:03 str4d

Force-pushed to move the new generated serial tables into separate submodules, and added cfg-flagged tests to generate them, and a CI job that verifies them. If this works, I'll attempt to replicate this for the vector tables.

Mar 30 '24 16:03 str4d

Force-pushed to fix the Fiat backends, and adjust the new CI check to fail if the table generators do nothing (as they generate output that is incorrectly formatted, and thus detectable).

Mar 30 '24 20:03 str4d

Force-pushed to implement a similar kind of generator approach for the AVX2 vector table. It doesn't currently work because the Debug impl for u32x8 doesn't print out values that are suitable for use in its constructor; additional post-processing will be required.

Mar 30 '24 21:03 str4d

Force-pushed to fix the AVX2 table generator. The generated constant is concretely different from before (I presume something changed about the wNAF implementation in the intervening four years), but tests pass before and after the change (and I checked that mutating either version of the constant causes a test to fail).

Apr 15 '24 07:04 str4d

Force-pushed to implement a generator for the IFMA vector table, based on the working AVX2 generator. It should work, but I don't have the hardware to run it, and so the IFMA constants remain invalid. Someone with compatible hardware needs to run the following commands on this branch:

$ RUSTFLAGS="--cfg curve25519_dalek_generate_tables" cargo +nightly test --all-features table_generators
$ cargo fmt

and then provide the resulting diff to the IFMA table.

Apr 15 '24 07:04 str4d

curve25519-dalek curve25519-dalek copied to clipboard

Implement ABGLSV-Pornin multiplication

AMD Ryzen 9 5950X

Serial: 17.2% decrease

AVX2: 7.7% decrease

Macbook Air M1

Serial: 14% decrease

curve25519-dalek
curve25519-dalek copied to clipboard