secp256k1 Native jacobi symbol algorithm

Native jacobi symbol algorithm

Open sipa opened this issue 2 years ago • 8 comments

This introduces variants of the vartime divsteps-based GCD algorithm used for modular inverses to compute Jacobi symbols. Changes compared to the normal vartime divsteps:

Only positive matrices are used, guaranteeing that f and g remain positive.
An additional jac variable is updated to track sign changes during matrix computation.
There is (so far) no proof that this algorithm terminates within reasonable amount of time for every input, but experimentally it appears to almost always need less than 900 iterations. To account for that, only a bounded number of iterations is performed (1500), after which failure is returned. The field logic then falls back to using square roots to determining the result.
The algorithm converges to f=g=gcd(f0,g0) rather than g=0. To keep this test simple, the end condition is f=1, which won't be reached if started with g=0. That case is dealt with specially.

This code is currently unused, except for tests. I don't aim for it to be merged until there is a need for it, but this demonstrates its feasibility.

In terms of performance:

field_inverse: min 1.76us / avg 1.76us / max 1.78us
field_inverse_var: min 0.991us / avg 0.993us / max 0.996us
field_jacobi_var: min 1.31us / avg 1.31us / max 1.31us
field_sqrt: min 4.36us / avg 4.37us / max 4.40us

while with the older (f24e122d13db7061b1086ddfd21d3a1c5294213b) libgmp based Jacobi code on the same system:

num_jacobi: min 1.53us / avg 1.54us / max 1.55us

Sep 23 '21 22:09 sipa

@robot-dreams Always nice to see new contributors in the repo. :) Just curious, are you simply helping or are you interested in getting this merged? If the latter, I'm further curious about your use case. Feel free just ignore my intrusive questions.

Nov 09 '21 08:11 real-or-random

@real-or-random Thanks for the reply! I'm simply helping and try to learn about this repo, I don't have a personal use case.

Nov 09 '21 17:11 robot-dreams

@robot-dreams Thanks for your comments!

The failure with asm=arm is expected, as Apple M1 is 64-bit ARM, while the assembly code used here is 32-bit.

Nov 09 '21 22:11 sipa

Made another change: in VERIFY mode, the upper bound on the number of posdivsteps iterations is reduced to ~750 (which is close to the median of how many iterations are needed for random input mod the field size), so that the uncomputable path is actually exercised in the tests.

Nov 09 '21 22:11 sipa

@sipa Thanks for responding to the comments!

Yes, 32-bit makes sense (in retrospect the fact that the file was called field_10x26 instead of field_5x52 should have given that away)
Great idea to reduce the number of iterations for validation!

Nov 10 '21 04:11 robot-dreams

One potential concern is that with this change, there are now 4 slightly different implementations of a similar algorithm (32/64-bit, inv/jacobi). Do you think there's a nice way to consolidate the logic, at least among a given word size? (If there's no nice way or if it's too much trouble, I'm fine with leaving it as is and relying on having good tests.)

Arguably there are 6 implementations (32/64-bit, and const/var/varpos divsteps), and they do share some of the code (in particular the matrix application), but not everything. I don't think there is an obvious way of merging more code without reducing performance, but I also haven't tried very hard.

What was the motivation for using different approaches to clear the low-order bits of g between the 32 and 64 bit implementations (8-bit lookup table vs 4 and 6-bit tricks from Hacker's Delight)?

Just dumb benchmarking. The formula was faster on x86_64 systems we tried, and the table was faster on ARMv7 systems (it's explained in one of the commit messages that introduced it).

I agree that replacing f, g = g, -f with f, g = g, f ensures that all matrix entries are always positive. But aside from actually observing that it converges to f = 1 in practice, do you have any other intuitive justification for this? (The best I can come up with is that since you're constantly dividing g by 2 and swapping the roles of f and g, you're quickly moving towards a solution even if you replace g - f with g + f in the original GCD algorithm.)

I don't, no. I do have a proof that any f>0,g>0 converges to f=g somewhere in a finite number of steps, but the upper bound is something like f*g*log2(f*g), which is obviously useless for practical purposes. The technique explained in https://github.com/sipa/safegcd-bounds completely fails for posdivsteps (doesn't converge at all). With a lot of tricks I got it to converge, but in what appears quadratic (O(log2(f)^2)), and it's only tractable to evaluate for f up to 2^20 or so.

The difficulty is that there are both steps that take f and g further apart (g /= 2) and steps that bring them closer (g = (f+g)/2), and it's hard to infer tight bounds on how many times one can happen in between two executions of the other.

So in short, no; the actual argument why this works is purely empirical; all out of a 2.5 billion random inputs (with the field size modulus) need between 630 and 921 iterations.

Is there already a writeup describing this change? If not, would it make sense to add a brief section to https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md? (If you think so, I'd be happy to give that a shot.)

I think that would be valuable, yes. I haven't done so yet, but was considering it. Feel free to take a shot at it.

Nov 10 '21 16:11 sipa

Included @robot-dreams's update to the writeup.

Nov 12 '21 16:11 sipa

ACK 53b7ce28bef8008787c1aeb202ec17ef3a3b3325

Nov 16 '21 21:11 robot-dreams

I removed a determinant check in the 64-bit jacobi code, as this new algorithm can result in determinant either 2^62 or -2^62, but the int128 code can't deal with that yet.

EDIT: fixed now, added two commits to deal with that.

Dec 10 '22 17:12 sipa

secp256k1 secp256k1 copied to clipboard

Native jacobi symbol algorithm

secp256k1
secp256k1 copied to clipboard