secp256k1
                                
                                 secp256k1 copied to clipboard
                                
                                    secp256k1 copied to clipboard
                            
                            
                            
                        Signed-digit multi-comb ecmult_gen algorithm
A third iteration of the signed-digit multi-comb ecmult_gen algorithm (earlier attempts: #693, and #546 by Peter Dettman). Short summary:
- A new constant-time point multiplication algorithm with precomputation (so only used for multiply with G).
- Based on section 3.3 of https://eprint.iacr.org/2012/309 by Mike Hamburg.
- Configurable through two parameters: COMB_BLOCKS and COMB_TEETH
- Currently only 3 predefined configurations reachable through ./configure. All three are included in precomputed_ecmult_gen.c and tested in CI.
- --with-ecmult-gen-kb=2: 2 kB table with COMB_BLOCKS=2 COMB_TEETH=5
- --with-ecmult-gen-kb=22: 22 kB table with COMB_BLOCKS=11 COMB_TEETH=6
- --with-ecmult-gen-kb=86: 86 kB table with COMB_BLOCKS=43 COMB_TEETH=6
 
- Many more configurations can be reached by manually setting the macros. These are not tested.
 
- Currently only 3 predefined configurations reachable through ./configure. All three are included in precomputed_ecmult_gen.c and tested in CI.
Compared with the previous PR #693:
- Updated to the new static-precomputation-only model (#893).
- Just 3 curated configurations reachable through configure.
- Removed some optimizations that do not matter (much).
- Do blinding through an final correction add rather than an initial start point, which may later permit usage of incomplete addition formulae (#1051).
- The recoding of the input scalar to signed bit representation is done slightly differently, which needs fewer special cases.
Benchmarks:
AMD Ryzen 5950X, GCC 11.2.0, default compile options
Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign
master:       schnorrsig_sign               ,    18.1       ,    18.1       ,    18.1    
pr1058 kb=2:  schnorrsig_sign               ,    17.9       ,    17.9       ,    17.9 
pr1058 kb=22: schnorrsig_sign               ,    15.1       ,    15.1       ,    15.1    
pr1058 kb=86: schnorrsig_sign               ,    14.4       ,    14.4       ,    14.4 
Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options
Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign
master:       schnorrsig_sign               ,    38.8       ,    38.9       ,    38.9
pr1058 kb=2:  schnorrsig_sign               ,    39.4       ,    39.4       ,    39.4
pr1058 kb=22: schnorrsig_sign               ,    33.2       ,    33.3       ,    33.3
pr1058 kb=86: schnorrsig_sign               ,    32.4       ,    32.4       ,    32.4
ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options
Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=100000 ./bench schnorrsig_sign
master:       schnorrsig_sign               ,   249.0       ,   249.0       ,   249.0
pr1058 kb=2:  schnorrsig_sign               ,   250.0       ,   250.0       ,   250.0 
pr1058 kb=22: schnorrsig_sign               ,   200.0       ,   200.0       ,   200.0
pr1058 kb=86: schnorrsig_sign               ,   192.0       ,   192.0       ,   192.0
I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.
I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.
The scalar halving also ensures that the low bit (the shifted-away bit) is 0.
I don't think that matters? Even/odd has no special meaning when working modulo a prime.
The PR currently uses the bits of scalar (input + 2^COMB_BITS - 1 - blind) * 2^-1 to select multiples of G to add together.
My suggestion is that instead it could use the bits of (input + 2^COMB_BITS - 1 - blind) to select multiples of G/2 to add together.
It seems I'm wrong, but I'm confused why!
I think confusion is creeping in since after the halving the scalar isn't a scalar value anymore; it's in signed-digit form, which can only represent an odd value. In particular the bits of a scalar s in signed-digit form represent the scalar value 2*s+1. I think you should also be careful not to reason modulo the order in signed-digit form.
Edit: Actually, even that might not be quite complete since the high bit is being treated specially here.
It works, but you need to use the bits of input - blind + (2^COMB_BITS - 1)/2 instead. That's what you get when you substitute 2*(input-blind) for e in the formula in the paper (because now we're trying to compute input*G = 2*(input-blind)*(G/2) + blind*G).
Oh I see, using double the input and the blind (of G/2) lets the halving be moved to precomputation. Nice.
Updated to use the avoid-halving-scalar trick.
As an exercise I added a 2 blocks, 4 teeth configuration:
- add {2,4} to CONFIGS table in precompute_ecmult_gen.c; nice and declarative.
- make precompute_ecmult_gen and run to get new precomputed_ecmult_gen.c which correctly includes new {2,4} option.
- modify configure.ac to support new "1" option for ecmult-gen-kb
- configure --with-ecmult-gen-kb=1, confirm in libsecp256k1-config.h that COMB_BLOCKS, COMB_TEETH are correct.
- make clean, make, tests to confirm it's working
This all worked without isues and was reasonably minimal effort. This particular example also satisfied me that there is no issue with combs covering exactly 256 bits.
@peterdettman FWIW, the easiest way of achieving the same would be:
- Modify configure.ac to support a new option
- Run configure with that option
- make clean-precomp && make normally
Because precompute_ecmult_gen is automatically built & run when a precomputed_*.c file is missing, and because precompute_ecmult_gen will always include the table for whatever the configuration is.
Having some confusion over possible bad interactions between modular reduction and signed-digit form, I wanted to reason things out in reverse.
For point P of order N, scalar s in [0,N), and an L-bit comb (2^L > N), let C(s,P) be the value calculated by our comb, which considers the bits of s, zero-extended to L bits as signed digits. Then:
    C(s,P) == (2.s - (2^L - 1)) * P
Therefore in order to use the comb to calculate k * G, we solve for the scalar t to use:
    k * G == C(t,G) == (2.t - (2^L - 1)) * G 
=>  k == 2.t - (2^L - 1) mod N
=>  t == (k + (2^L - 1))/2 mod N
Can we skip that halving and use G/2 instead?
    C(2t,G/2) == (4.t - (2^L - 1)) * G/2
              == (2.t - (2^L - 1)/2) * G
              != C(t,G) unless 2^L == 1 mod N
So no, but let's back up a step and ask what scalar u to use in the comb to calculate k * G as 2.k * (G/2):
    2.k * (G/2) == C(u,G/2) == (2.u - (2^L - 1)) * G/2
=>  2.k == 2.u - (2^L - 1) mod N
=>  u == k + (2^L - 1)/2 mod N
and since L is constant, the halving is now only needed in the precomputation.
Scalar blinding (using b * G == B):
    k * G == (k - b) * G + B == 2.(k - b) * (G/2) + B == C(k - b + (2^L - 1)/2, G/2) + B
where -b + (2^L - 1)/2 can be precomputed.
Nice, that's much better explained than my current comments. I'll try to include it.
Although the above satisfies me mathematically, I would like to see an explicit test case where the comb_offset (2^L - 1)/2 causes a modular reduction (relative to k-b). e.g. arrange for k-b == N + 1 - comb_offset. I hope that's not too painful, but otherwise random testing seems unlikely to hit such a case.
@peterdettman Making this change causes instant failure during the tests, at least:
--- a/src/ecmult_gen_impl.h
+++ b/src/ecmult_gen_impl.h
@@ -78,7 +78,7 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
      */
 
     /* Compute the scalar (gn + ctx->scalar_offset). */
-    secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn);
+    CHECK(!secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn));
     /* Convert to recoded array. */
     for (i = 0; i < 8; ++i) {
         recoded[i] = secp256k1_scalar_get_bits(&tmp, 32 * i, 32);
Including a test case that results in tmp=0 or tmp=1 or so may be useful though.
scalar_offset includes the randomly-distributed blind, so there will be modular reductions. This isn't concerning at all because we can reason about the blind independently of the comb (just input offset + output offset).
However the comb offset is small: 2^(COMB_BITS-256) * (2^256 - N). So I would like a test that involves the comb_offset itself causing a modular reduction. The math checks out of course, but an explicit test can't hurt.
Actually, I think (2^L - 1)/2 mod N is only small (128 bits) if L == 256.
For the 43/6/1 configuration it is 0x80000000000000000000000000000001e7f9b4a5f9130fa66044722cc7ae9e1e For the 11/6/4 configuration it is 0x8000000000000000000000000000000987e0873ddd5f4e3fe1563adfe6691698
Including a test case that results in
tmp=0ortmp=1or so may be useful though.
So, these and maybe -1 would be enough. I will check a 256-bit config manually if they are added.
@peterdettman I've incorporated your derivation in the comments in ecmult_gen_impl.h, and added a test case for recoded={-1,0,1}.
I've worked on an additional change that introduces a COMB_RANGE which is normally 256, but in exhaustive test mode corresponds to the number of bits in EXHAUSTIVE_TEST_ORDER. Then COMB_BITS only has to cover COMB_RANGE etc, instead of always being at least 256.
And I'm seeing a suspicious failure in certain configurations with that (even making sure that the precomputed table only contains non-infinity points).
I'll retry this approach to make sure I haven't missed something.
I added a commit that permits COMB_BITS < 256 in exhaustive test mode. However, it doesn't work in a lot of configurations, and I don't understand what's causing it.
Here is a list of (blocks teeth spacing) tuples and whether they work (for both order 13 and 199): groups.txt
Update: it appears that EXHAUSTIVE_TEST_ORDER < 2**(BLOCKS * TEETH * (SPACING - 1)) perfectly predicts which configurations work.
Final update: I was being very dumb, and precompute_ecmult_gen just had spacing = (COMB_RANGE + blocks * teeth) / (block * teeth) hardcoded, leading to an inconsistency between the table and the actual multiplication code.
False alarm.
Rebased.
I believe this PR is ready for review (the discussion above was just observing some configurations not working due to a fairly stupid bug which was fixed).
Rebased after merge of #1178. Added changelog entry.
Rebased after #1187 merge.
Rebased after merge of #1170 and #1190.
Rebased, and added to cmake.