secp256k1 Try a non-uniform group law (e.g., for ecmult

Try a non-uniform group law (e.g., for ecmult_gen)?

Open real-or-random opened this issue 2 years ago • 30 comments

The group law in secp256k1_gej_add_ge is uniform, i.e., it works for P + Q no matter if P != Q or P = Q. (It still needs to handle infinity and has another special case.)

It needs 7 fe mul and 5 fe sqr, the best non-uniform group law in EFD needs one fewer sqr (https://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl). Mabye it's faster also in practice. We could simply implement it and benchmark it.

Depending on the specifics of our ecmult_gen algorithm, it may be the case that this algorithm never adds P + Q with P = Q (for example, the ecmult algorithm within the MuSig-DN circuit has a similar property)). Then, we could get away with the non-uniform law in ecmult_gen.

I think we shouldn't really think about this before https://github.com/bitcoin-core/secp256k1/pull/693 has been done.

Dec 23 '21 15:12 real-or-random

ISTR trying "a better" group law from EFD, many years ago, which turned out to be slower because of normalization even though it technically required fewer ops.

May be worth revisiting this after recent Dettman normalization bounds improvements though?

Dec 23 '21 15:12 apoelstra

ISTR trying "a better" group law from EFD, many years ago, which turned out to be slower because of normalization even though it technically required fewer ops.

Maybe what you're remembering is https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h#L274-L280 ? But this is for gej_double. (I had brought this up in https://github.com/bitcoin-core/secp256k1/pull/1033#issuecomment-999520374).

Dec 23 '21 16:12 real-or-random

Doubtless it will be faster. The most obvious use-case to me is for precomputed tables of small multiples.

Back in the long-long-ago I PR'd some CoZ formulae for precomputed tables (https://github.com/bitcoin-core/secp256k1/pull/41 - LOL at "That's the GMP inversion[...]; it will be hard to match it's performance.") and I will be looking to resurrect that as a private implementation for a separated out "precomp_table" (or some such) module.

However @sipa also noted there "I don't like relying on relatively complex reasoning to show that a code path shouldn't be reachable, even if it just contains an assert" and that was in relation to this issue of assuming P != +/-Q.

#1032 is an attempt to begin putting in place some VERIFY metadata for group elements. I think proposals that rely on, shall we say, "non-local reasoning" for correctness would ideally be accompanied by a plan for using VERIFY metadata to validate that reasoning.

One example is "small multiples reasoning". The metadata could hold e.g. an actual ge_storage value of the original point (or some other unique identifier for the overall "operation"), and an int indicating the current multiple (or e.g 0 if it's not participating in such a scheme). Then any methods operating on these points can have suitable checks.

I think another example of the kind of reasoning people sometimes use is "odd OR even multiple of P, not exceeding half the order", and then non-uniform additions of "odd" points might be used for a ladder as long as there is an intervening double between them, with maybe one or two uniform additions at the end when "half the order" becomes a problem.

That last example is one where you'd need to be very careful in the presence of the endomorphism, but we could probably just say "Beware the endomorphism" in general here. Also notably some kinds of blinding generally don't play well with non-uniform formulae.

Dec 24 '21 13:12 peterdettman

@apoelstra I seem to recall you like endomorphism math :), so a holiday puzzle for you in relation to the above:

If I'm enumerating some non-infinity point on secp256k1, 1P, 2P, 3P..., at what point (haha) nP do I need to start worrying that I may run into nP == +/- lambda(mP) where 1 <= m <= n?

Dec 24 '21 13:12 peterdettman

I spent some time thinking about using non-uniform addition (which I assume means a rule that cannot handle doubling or cancellation correctly) in combination with signed-digit multicomb, because there seems a natural overlap: in every step SDMC adds a nonzero point, with a multiple of G that whose scalar has new bits set.

Turns out, it is not that simple (I simulated it in Python for a small group), but it is close. I suspect that for almost all SDMC configurations, only the very last point addition can result in doubling or cancelling. I think this can be proven, but doing so generically for any configuration may be nontrivial. Perhaps we'll need some whitelisted precomputed configurations for which this is proven to hold.

A downside of this approach is that it cannot start with a blinded starting point, as obviously nothing can be guaranteed about intermediary results when starting from a fully random point. So if blinding is done on the scalar, it should be counteracted by an additional (uniform) point addition at the very end. That's not actually an extra cost, if the starting point can be made infinity (so the first table lookup is just a fetch rather than an add).

Dec 24 '21 13:12 sipa

@sipa Generally the signed-data representation extends past the order (e.g. 260 bits). You could have a rule that only the highest block can cross that threshold, then special-case the additions for the top block?

Edit: 260, not 250!

Dec 24 '21 13:12 peterdettman

Hmm, not that simple either probably, since it's top to bottom.

Dec 24 '21 13:12 peterdettman

@peterdettman I don't think it's that simple. At every point, your intermediary result can be written as sum(d_j*2^j*G) for some values d_j which are all 1,0,-1. The zero/nonzeroness is statically known, while the sign depends on the actual scalar. A cancellation/doubling occurs when the intermediary value is +- equal to the table entry being added. You can statically reason about every table entry, and trying to match it +- to signs of the existing pattern so far. If that's not possible for every table entry, you know doubling/cancelling cannot occur.

Even if you were to do a reduction mod N on the sign-recoded scalar (guaranteeing only 256 d values), you may get collisions between the existing pattern and what is being added, because the 0s in the pattern count as 1/2 in the signed recoding, and (1/2 mod N) isn't a perfectly nice number.

Dec 24 '21 14:12 sipa

@sipa Recall that if you just shift the SD bits back "up" 1, then they're all now representing 0 or 1 again, and each table entry is handling strictly distinct bits, so problems should only happen once the high bits start getting added in, no?

Dec 24 '21 14:12 peterdettman

@peterdettman I'm not convinced, because the pattern of zero/nonzero bits differs between the intermediary value and the table entry being added, and the difference between these patterns translates to less nice patterns when "projected" onto pure 1,-1 values. I think it is indeed the case that you don't hit collisions until you've added almost all bits, but I believe this is mostly due to information theoretical reasons (the table entries are drawn from a small sets, the intermediary values are a small fraction of the total space until the very end, collisions are just very rare).

Dec 24 '21 14:12 sipa

@sipa In particular, if COMB_SPACING is larger than the excess bits of the SD recoding, then each top-level pass in _ecmult_gen (i.e. the loop over the blocks) covers < 256 bit range and could be accumulated individually with non-uniform addition, and then added (uniform) to the overall accumulator, just before the doubling for each pass. Not saying it's ideal, but that should be correct ignoring the blinding etc.

Dec 24 '21 14:12 peterdettman

@apoelstra I seem to recall you like endomorphism math :), so a holiday puzzle for you in relation to the above:

If I'm enumerating some non-infinity point on secp256k1, 1P, 2P, 3P..., at what point (haha) nP do I need to start worrying that I may run into nP == +/- lambda(mP) where 1 <= m <= n?

cc @roconnor-blockstream

Dec 24 '21 23:12 real-or-random

I conjecture that a1,b1 and a2,b2 from https://github.com/bitcoin-core/secp256k1/blob/423b6d19d373f1224fd671a982584d7e7900bc93/src/scalar_impl.h#L86-L89 are the smallest pairs of values where a1 = -lambda*b1 for some definition of smallest. And that you will get almost identical values for lambda and +/-lambda^2.

My handwavy argument is that if smaller values existed, we would be using them instead.

Dec 25 '21 15:12 roconnor-blockstream

@peterdettman I wrote a simulator for the existing code that can determine for every addition in the SDMC algorithm which ones risk triggering doubling/cancellation.

For all the choices on the efficiency frontier from https://github.com/bitcoin-core/secp256k1/pull/693#issuecomment-553733708, only very small ones (1 block, 1-2 teeth) have more than 1 addition that needs to deal with doubling/cancelling. Assuming my code is right, of course. Choices with teeth=1 or spacing=1 seem often bad in this regard (e.g. 32 blocks of 8 teeth has 16 possibly-doubling additions), but excluding those two, no configuration has more than 2 possibly-doubling ones.

Dec 25 '21 22:12 sipa

Okay, sometimes it helps to read the literature.

It needs 7 fe mul and 5 fe sqr, the best non-uniform group law in EFD needs one fewer sqr (hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html#addition-madd-2007-bl).

Renes, Costello and Batina (EUROCRYPT 2016) provide a complete, exception-free addition law with 11M + 0S (Algorithm 8 and page 12). It may still be the case that add-2007-bl is faster but it's not clear.

This algorithm also highly relevant to #1033.

And it's even more relevant to verification speed. See the discussion on gej_add_var on page 4... We can save 4S, which means we go from 12M+4S to 12M+0S (!!!). @sipa Didn't they contact you back then?

Dec 26 '21 00:12 real-or-random

@real-or-random Woah! No, that's news to me. Just skimmed the paper, and it seems it is using homogeneous coordinates, not Jacobian ones? That may require rewriting all of the group laws...

Dec 26 '21 03:12 sipa

@real-or-random I'm sure I've seen this paper before, and I would have thought it was already discussed (perhaps @gmaxwell brought it to our attention?), but am not sure. Anyway, some notes to curb the enthusiasm a little:

Requires a separate homogeneous coordinates group element. Not a problem per se, just add secp256k1_geh and go from there, but will be extra code and probably will need, besides geh-specific formulae, several conversion methods.
Relevant to batch verification; single-signature verification (Strauss) uses _gej_add_ge_var. This is also I think why the paper didn't make a splash here in 2016, since the ecmult_multi stuff is relatively recent.
Incurs an expensive doubling formula, which will offset potential gains from faster addition. In some scenarios doubling is only a minority of operations and it may work out. For a run of say 4 or 5 doublings it might be possible to sidestep into jacobian coordinates and back to mitigate the damage.
Formulae use the curve b coordinate; this doesn't play well with our isomorphism (effective-affine) tricks. It's not insurmountable mathematically, but those "mul by small multiple of b" counts would at least become full field multiplications if b' -> b.Z^6 for some arbitrary Z.

Haven't (re-)read in detail yet, so it's quite possible there's still gold there that I haven't noticed. I think it would still be informative to put together a branch where we add secp256k1_geh and try to use it in the ecmult_multi stuff, or wherever seems most promising to people.

Dec 26 '21 06:12 peterdettman

Oh indeed, I missed that they use homogeneous coordinates (I just saw the Z coord and assumed jacobian). (I had this feeling that this is too good to be true but I didn't see the issue. :D)

And yes, sure, when it comes to variable-time ecmults, it's relevant only to Pippenger's algorithm. But in this case it could be beneficial if we compute the doublings in jacobian coordinates (a bunch of doublings) and then do the main part in homogeneous coordinates (additions). See also Remark 3 (p. 16) in the paper. (edit: Hmmm, maybe we'd need to keep the buckets in homogeneous coordinates too.... It's not clear right now to me, we will need to try.)

Formulae use the curve b coordinate; this doesn't play well with our isomorphism (effective-affine) tricks. It's not insurmountable mathematically, but those "mul by small multiple of b" counts would at least become full field multiplications if b' -> b.Z^6 for some arbitrary Z.

Right, but we use the tricks only for in Strauss' algorithm. But yeah, this is also bad for the exhaustive tests, I guess.

Dec 26 '21 10:12 real-or-random

I wrote a simulator for the existing code that can determine for every addition in the SDMC algorithm which ones risk triggering doubling/cancellation.

It sounds promising. I'm not quite clear on how the 32 blocks/8 teeth thing has so many danger points, but my mind is kind of on other stuff at the moment.

Do we see any way that the endomorphism can be useful in signing? I see at least that you could split the scalar and use half the blocks e.g. 2B * 5T * 13S to cover 130 bits (unfortunately this probably needs skew-handling like ecdh). This gains nothing if you need 2 separate sets of tables anyway, but it halves memory usage if you construct the endo-points on-the-fly instead of the second set of tables (in this 2/5/13 config that's an extra 26 field muls). Probably that doesn't gain much given that you can probably just tweak the SDMC config to use less memory at a similar performance cost anyway. I guess if a given config has lots of doublings (big spacing), then half-ish of them could be saved instead. Anything else?

Dec 26 '21 12:12 peterdettman

@peterdettman I'm sure I've had discussions about this, but I can't find it on the issue tracker. Yes, I think using the endomorphism may give another trade-off dimension to ecmult_gen. As you say, it won't speed things up in absolute terms, but you could +- halve the table size, split the computation in two, and do lambda-multiplies on table fetches in the second part.

Dec 26 '21 14:12 sipa

And yes, sure, when it comes to variable-time ecmults, it's relevant only to Pippenger's algorithm. But in this case it could be beneficial if we compute the doublings in jacobian coordinates (a bunch of doublings) and then do the main part in homogeneous coordinates (additions). See also Remark 3 (p. 16) in the paper. (edit: Hmmm, maybe we'd need to keep the buckets in homogeneous coordinates too.... It's not clear right now to me, we will need to try.)

Fwiw, I sketched a modified variant of Pippenger's algorithm that uses homogeneous coordinates in the main loop (without actually implementing the group law): https://github.com/real-or-random/secp256k1/commit/dddbf8129f23b434157446b5a4c2f3d2636e8fc5

I think it's promising...

Dec 26 '21 15:12 real-or-random

@real-or-random Looks good, yes. I think the final few lines are a bit off though: buckets[0] is missing a conversion, and also running_sum, or maybe it's as easy to add buckets[0] after running_sum is converted.

Dec 26 '21 16:12 peterdettman

https://github.com/sipa/secp256k1/commits/202112_combine_ecmult_gen (=master + #1058 + #1056 + #1033 + #1028 + incomplete addition formula): 12.3 us per schnorrsig (master is 17.9 us).

Dec 31 '21 22:12 sipa

sipa/secp256k1@202112_combine_ecmult_gen (commits) (=master + #1058 + #1056 + #1033 + #1028 + incomplete addition formula): 12.3 us per schnorrsig (master is 17.9 us).

Which law did you try? The one I mentioned in the initial comment needs 7M+4S but the one you implemented in your branch needs 8M+3S.

Jan 01 '22 12:01 real-or-random

I just copied our vartime code, without the var parts.

It's easy to plug in something else for testing.

Jan 01 '22 12:01 sipa

The 7M + 4S one requires calculating 2.Z1.H as (Z1+H)^2 - Z1^2 - H^2, which is a well-known trick, but hasn't tended to speed things up for us (especially if we only want Z1.H because of halving). For me S ~ 0.8M, and the alternative formula requires extra additions, sometimes an extra negation, sometimes a normalization, sometimes extra locals or longer lifetimes etc.

However if it even breaks even for 64bit, it could easily be a speedup for 32bit where linear operations are significantly cheaper in relation to mul/sqr.

Jan 01 '22 13:01 peterdettman

Questions like that may become more complicated with #967 too (as it introduces a new field approach with very different add/mul/normalize tradeoffs, and an independent magnitude-like dimension).

Jan 01 '22 13:01 sipa

Renes–Costello–Batina seems like a great security improvement since those are exception-free / remove if branching for zero checks and could improve constant-timeness. Fun fact: 2 years ago it was 20% slower than the classic ones in js (https://github.com/paulmillr/noble-secp256k1/issues/5), but const-time doesn't matter for js.

Jan 11 '22 03:01 paulmillr

WIP simulation code to determine which point additions inside the SDMC algorithm risk triggering doubling (P+P) or cancellation (P + -P): https://gist.github.com/sipa/868c39e29af9e8baf22845c8af2d316d. It would appear that for really all feasible (blocks,teeth,spacing) combinations, either only the very last addition, or the last two additions, risk running into doubling/cancellation.

Nov 30 '22 18:11 sipa

I believe that our current ecmult_const algorithm cannot trigger doubling or cancellation in the first 63 iterations of the ladder (out of 64).

During the algorithm, the scalar $a$ is converted into $x + \lambda y$, where $(x,y) \in [1-2^{128},2^{128}-1]^2$. Those $x$ and $y$ are then written in wNAF-like signed-digit form:

$$x = \sum_{i=0}^{63} x_i 2^{4i} \textrm{, where } x_i \in \{-15,-13,\ldots,15\}$$

And analogously for $y$ (which possibly needs an offset of $1$, or $\lambda$, or both, which will be corrected after the algorithm loop, and isn't relevant here).

The algorithm for computing $a\cdot P$ then proceeds as:

Start with result point $R = 0$.
For $i = 0 \ldots 63$:
- $R \leftarrow 2^4 R \quad$ (no-op at $i=0$).
- $R \leftarrow R + x_{63-i} P$.
- $R \leftarrow R + y_{63-i} \lambda P$.

It's easy to see that the result of both point additions in the loop above is of the form $(a + \lambda b) P$, where $(a,b) \in [1-2^{4(i+1)},2^{4(i+1)}-1]^2 \setminus \{(0,0)\}$.

Up to $i=62$, no such $(a,b)$ exists for which $a + \lambda b = 0\ (mod\ n)$. That means that neither of the two point addition can result in cancellation. By observing that the $x_{63-i}$ or $y_{63-i}$ could be be negated, and knowing that no $R + x_{63-i}P = 0$, we also know that $R + (-x_{63-i}P) = 0$ is impossible, meaning doubling is impossible.

To show that no non-zero $(a,b)$ in $[1-2^{124},2^{124}-1]$ exists for which $a + \lambda b = 0\ (mod\ n)$ exists, observe that the solutions to $(a,b)$ form a lattice, spanned by $(n,0)$, $(0,n)$, and $(\lambda, -1)$. The minimal vector in that lattice under the $\|\cdot\|_\infty$ norm is $(0x3086d221a7d46bcde86c90e49284eb15, -0xe4437ed6010e88286f547fa90abfe4c3)$.

Dec 03 '22 16:12 sipa

secp256k1 secp256k1 copied to clipboard

Try a non-uniform group law (e.g., for ecmult_gen)?

secp256k1
secp256k1 copied to clipboard