secp256k1 Signed-digit multi-comb ecmult

A third iteration of the signed-digit multi-comb ecmult_gen algorithm (earlier attempts: #693, and #546 by Peter Dettman). Short summary:

A new constant-time point multiplication algorithm with precomputation (so only used for multiply with G).
Based on section 3.3 of https://eprint.iacr.org/2012/309 by Mike Hamburg.
Configurable through two parameters: COMB_BLOCKS and COMB_TEETH
- Currently only 3 predefined configurations reachable through ./configure. All three are included in precomputed_ecmult_gen.c and tested in CI.
  - --with-ecmult-gen-kb=2: 2 kB table with COMB_BLOCKS=2 COMB_TEETH=5
  - --with-ecmult-gen-kb=22: 22 kB table with COMB_BLOCKS=11 COMB_TEETH=6
  - --with-ecmult-gen-kb=86: 86 kB table with COMB_BLOCKS=43 COMB_TEETH=6
- Many more configurations can be reached by manually setting the macros. These are not tested.

Compared with the previous PR #693:

Updated to the new static-precomputation-only model (#893).
Just 3 curated configurations reachable through configure.
Removed some optimizations that do not matter (much).
Do blinding through an final correction add rather than an initial start point, which may later permit usage of incomplete addition formulae (#1051).
The recoding of the input scalar to signed bit representation is done slightly differently, which needs fewer special cases.

Dec 29 '21 20:12 sipa

Benchmarks:

AMD Ryzen 5950X, GCC 11.2.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign

master:       schnorrsig_sign               ,    18.1       ,    18.1       ,    18.1    
pr1058 kb=2:  schnorrsig_sign               ,    17.9       ,    17.9       ,    17.9 
pr1058 kb=22: schnorrsig_sign               ,    15.1       ,    15.1       ,    15.1    
pr1058 kb=86: schnorrsig_sign               ,    14.4       ,    14.4       ,    14.4

Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign

master:       schnorrsig_sign               ,    38.8       ,    38.9       ,    38.9
pr1058 kb=2:  schnorrsig_sign               ,    39.4       ,    39.4       ,    39.4
pr1058 kb=22: schnorrsig_sign               ,    33.2       ,    33.3       ,    33.3
pr1058 kb=86: schnorrsig_sign               ,    32.4       ,    32.4       ,    32.4

ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=100000 ./bench schnorrsig_sign

master:       schnorrsig_sign               ,   249.0       ,   249.0       ,   249.0
pr1058 kb=2:  schnorrsig_sign               ,   250.0       ,   250.0       ,   250.0 
pr1058 kb=22: schnorrsig_sign               ,   200.0       ,   200.0       ,   200.0
pr1058 kb=86: schnorrsig_sign               ,   192.0       ,   192.0       ,   192.0

Dec 29 '21 23:12 sipa

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

Dec 30 '21 14:12 sipa

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

The scalar halving also ensures that the low bit (the shifted-away bit) is 0.

Dec 30 '21 15:12 peterdettman

I don't think that matters? Even/odd has no special meaning when working modulo a prime.

The PR currently uses the bits of scalar (input + 2^COMB_BITS - 1 - blind) * 2^-1 to select multiples of G to add together.

My suggestion is that instead it could use the bits of (input + 2^COMB_BITS - 1 - blind) to select multiples of G/2 to add together.

Dec 30 '21 15:12 sipa

It seems I'm wrong, but I'm confused why!

Dec 30 '21 15:12 sipa

I think confusion is creeping in since after the halving the scalar isn't a scalar value anymore; it's in signed-digit form, which can only represent an odd value. In particular the bits of a scalar s in signed-digit form represent the scalar value 2*s+1. I think you should also be careful not to reason modulo the order in signed-digit form.

Edit: Actually, even that might not be quite complete since the high bit is being treated specially here.

Dec 30 '21 15:12 peterdettman

It works, but you need to use the bits of input - blind + (2^COMB_BITS - 1)/2 instead. That's what you get when you substitute 2*(input-blind) for e in the formula in the paper (because now we're trying to compute input*G = 2*(input-blind)*(G/2) + blind*G).

Dec 30 '21 15:12 sipa

Oh I see, using double the input and the blind (of G/2) lets the halving be moved to precomputation. Nice.

Dec 30 '21 15:12 peterdettman

Updated to use the avoid-halving-scalar trick.

Dec 30 '21 17:12 sipa

As an exercise I added a 2 blocks, 4 teeth configuration:

add {2,4} to CONFIGS table in precompute_ecmult_gen.c; nice and declarative.
make precompute_ecmult_gen and run to get new precomputed_ecmult_gen.c which correctly includes new {2,4} option.
modify configure.ac to support new "1" option for ecmult-gen-kb
configure --with-ecmult-gen-kb=1, confirm in libsecp256k1-config.h that COMB_BLOCKS, COMB_TEETH are correct.
make clean, make, tests to confirm it's working

This all worked without isues and was reasonably minimal effort. This particular example also satisfied me that there is no issue with combs covering exactly 256 bits.

Dec 31 '21 07:12 peterdettman

@peterdettman FWIW, the easiest way of achieving the same would be:

Modify configure.ac to support a new option
Run configure with that option
make clean-precomp && make normally

Because precompute_ecmult_gen is automatically built & run when a precomputed_*.c file is missing, and because precompute_ecmult_gen will always include the table for whatever the configuration is.

Dec 31 '21 15:12 sipa

Having some confusion over possible bad interactions between modular reduction and signed-digit form, I wanted to reason things out in reverse.

For point P of order N, scalar s in [0,N), and an L-bit comb (2^L > N), let C(s,P) be the value calculated by our comb, which considers the bits of s, zero-extended to L bits as signed digits. Then:

    C(s,P) == (2.s - (2^L - 1)) * P

Therefore in order to use the comb to calculate k * G, we solve for the scalar t to use:

    k * G == C(t,G) == (2.t - (2^L - 1)) * G 
=>  k == 2.t - (2^L - 1) mod N
=>  t == (k + (2^L - 1))/2 mod N

Can we skip that halving and use G/2 instead?

    C(2t,G/2) == (4.t - (2^L - 1)) * G/2
              == (2.t - (2^L - 1)/2) * G
              != C(t,G) unless 2^L == 1 mod N

So no, but let's back up a step and ask what scalar u to use in the comb to calculate k * G as 2.k * (G/2):

    2.k * (G/2) == C(u,G/2) == (2.u - (2^L - 1)) * G/2
=>  2.k == 2.u - (2^L - 1) mod N
=>  u == k + (2^L - 1)/2 mod N

and since L is constant, the halving is now only needed in the precomputation.

Scalar blinding (using b * G == B):

    k * G == (k - b) * G + B == 2.(k - b) * (G/2) + B == C(k - b + (2^L - 1)/2, G/2) + B

where -b + (2^L - 1)/2 can be precomputed.

Jan 01 '22 12:01 peterdettman

Nice, that's much better explained than my current comments. I'll try to include it.

Jan 01 '22 12:01 sipa

Although the above satisfies me mathematically, I would like to see an explicit test case where the comb_offset (2^L - 1)/2 causes a modular reduction (relative to k-b). e.g. arrange for k-b == N + 1 - comb_offset. I hope that's not too painful, but otherwise random testing seems unlikely to hit such a case.

Jan 01 '22 12:01 peterdettman

@peterdettman Making this change causes instant failure during the tests, at least:

--- a/src/ecmult_gen_impl.h
+++ b/src/ecmult_gen_impl.h
@@ -78,7 +78,7 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
      */
 
     /* Compute the scalar (gn + ctx->scalar_offset). */
-    secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn);
+    CHECK(!secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn));
     /* Convert to recoded array. */
     for (i = 0; i < 8; ++i) {
         recoded[i] = secp256k1_scalar_get_bits(&tmp, 32 * i, 32);

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

Jan 01 '22 16:01 sipa

scalar_offset includes the randomly-distributed blind, so there will be modular reductions. This isn't concerning at all because we can reason about the blind independently of the comb (just input offset + output offset).

However the comb offset is small: 2^(COMB_BITS-256) * (2^256 - N). So I would like a test that involves the comb_offset itself causing a modular reduction. The math checks out of course, but an explicit test can't hurt.

Jan 01 '22 16:01 peterdettman

Actually, I think (2^L - 1)/2 mod N is only small (128 bits) if L == 256.

Jan 01 '22 16:01 peterdettman

For the 43/6/1 configuration it is 0x80000000000000000000000000000001e7f9b4a5f9130fa66044722cc7ae9e1e For the 11/6/4 configuration it is 0x8000000000000000000000000000000987e0873ddd5f4e3fe1563adfe6691698

Jan 01 '22 16:01 sipa

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

So, these and maybe -1 would be enough. I will check a 256-bit config manually if they are added.

Jan 01 '22 16:01 peterdettman

@peterdettman I've incorporated your derivation in the comments in ecmult_gen_impl.h, and added a test case for recoded={-1,0,1}.

Jan 01 '22 18:01 sipa

I've worked on an additional change that introduces a COMB_RANGE which is normally 256, but in exhaustive test mode corresponds to the number of bits in EXHAUSTIVE_TEST_ORDER. Then COMB_BITS only has to cover COMB_RANGE etc, instead of always being at least 256.

And I'm seeing a suspicious failure in certain configurations with that (even making sure that the precomputed table only contains non-infinity points).

I'll retry this approach to make sure I haven't missed something.

Jan 04 '22 16:01 sipa

I added a commit that permits COMB_BITS < 256 in exhaustive test mode. However, it doesn't work in a lot of configurations, and I don't understand what's causing it.

Here is a list of (blocks teeth spacing) tuples and whether they work (for both order 13 and 199): groups.txt

Jan 04 '22 23:01 sipa

Update: it appears that EXHAUSTIVE_TEST_ORDER < 2**(BLOCKS * TEETH * (SPACING - 1)) perfectly predicts which configurations work.

Jan 05 '22 15:01 sipa

Final update: I was being very dumb, and precompute_ecmult_gen just had spacing = (COMB_RANGE + blocks * teeth) / (block * teeth) hardcoded, leading to an inconsistency between the table and the actual multiplication code.

False alarm.

Jan 05 '22 19:01 sipa

Rebased.

Nov 23 '22 04:11 sipa

I believe this PR is ready for review (the discussion above was just observing some configurations not working due to a fairly stupid bug which was fixed).

Nov 23 '22 13:11 sipa

Rebased after merge of #1178. Added changelog entry.

Dec 29 '22 04:12 sipa

Rebased after #1187 merge.

Jan 11 '23 21:01 sipa

Rebased after merge of #1170 and #1190.

Jan 19 '23 20:01 sipa

Rebased, and added to cmake.

May 24 '23 01:05 sipa

secp256k1
secp256k1 copied to clipboard

Signed-digit multi-comb ecmult_gen algorithm

AMD Ryzen 5950X, GCC 11.2.0, default compile options

Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options

ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options

secp256k1 secp256k1 copied to clipboard

Signed-digit multi-comb ecmult_gen algorithm

AMD Ryzen 5950X, GCC 11.2.0, default compile options

Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options

ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options

secp256k1
secp256k1 copied to clipboard