bingcd
bingcd copied to clipboard
gf_mul not normalized
Hello,
gf a,b,c;
a.v0 = 0x29D6DEAB9CB8606C;
a.v1 = 0x788DD408827B63FD;
a.v2 = 0x3CFC744C965683E6;
a.v3 = 0x24E016B1490AA31A;
b.v0 = 4;
b.v1 = 0;
b.v2 = 0;
b.v3 = 0;
gf_mul(&c,&a,&b);
//gf_normalize(&c,&c);
print_gf("INV506",&c);
It seems that gf_mul does not handle the case when the result fit in 256bit but is greater than P. Without gf_normalize() the above code returns: INV506 = 0x93805AC5242A8C68F3F1D132595A0F99E237502209ED8FF4A75B7AAE72E181B0 With gf_normalize(): INV506 = 0x13805AC5242A8C68F3F1D132595A0F99E237502209ED8FF4A75B7AAE72E181C3
Anyway, your last fix is ok, The worst case I found for the moment is: L is len(a) + len(b) at the output of the 465 iterations.
L=40 (a,b)=42949672959,11 =>y = 0x0400000000000000080000000000000010000000000000002000000000000000
L=41 (a,b)=85899345918,11 =>y = 0x0800000000000000100000000000000020000000000000004000000000000000
L=42 (a,b)=171798691836,11 =>y = 0x1000000000000000200000000000000040000000000000008000000000000000
I forget to mention that I don't use the assembly code for gf_mul.
0x93805AC5242A8C68F3F1D132595A0F99E237502209ED8FF4A75B7AAE72E181B0 and 0x13805AC5242A8C68F3F1D132595A0F99E237502209ED8FF4A75B7AAE72E181C3 are actually equal modulo p; they represent the same value (and this is the correct result). In my code, field elements are represented as integers in the 0 to 2**256-1 range, and all 256-bit values are valid. The gf_add()
, gf_mul()
... functions accept such input, and, correspondingly, don't enforce outputs to be in the 0 to p-1 range. This implies that any given field element has several possible in-memory representations; this saves a few cycles here and there. To get a value in the 0 to p-1 range, you have to call gf_normalize()
explicitly (this is done, for instance, before encoding into bytes); this also explains that gf_iszero()
and gf_eq()
are not immediate (they have to account for the multiplicity of representations).
Ok thanks for clarification so that's a normal behavior and the normalization is not included gf_inv(). Concerning the max len of a and b, If I left shift the above value (0x1....) by 2 I reach L=44 , near the max, it is just a quick test. I think it ends in 506 or 505 iterations. I will try to go deeply to see if i can reach more...
If you start with input 2^254, then the algorithm will first halve that value 254 times, and when it reaches 1, a swap will occur, putting in a the value of b which is still equal to 2^255-19 at this point. Since now b = 1, the next 254 iterations will each reduce a by 1 bit (basically subtracting 1 if the value is odd, then dividing by 2, each time). It will thus take 508 iterations to reach a = b = 1. Therefore, the worst case must be at least 508 iterations. If I did not goof up (again) with the proof, no input requires more than 508 iterations, and the bound is optimal.
Yes but I'm trying to see if it is possible to make the line 11 of your algo to fail. I would like to test this because I wrote 10 years ago a divstep62 algo without a (a,b) balancing (i removed the delta variable that balance the progress of a and b , so the swap decision) and it worked rather well with a low overhead, ~9 divstep with delta balancing and ~10 divstep without balancing (in average) for a variable time implementation. I also avoided the final modmul for the normalization by performing a Montgomery reduction to divide the bezout coef at each step by 2^62, they took that in the bitcoin core PR rather than doing matrix/matrix product as described in the safegcd paper. With your method, as the balancing is exact (if the proof is right) the number of divstep should be lower and it should be possible to write a divstep62 but it will require a 2x2 32bit matrix multiplication between the 2 divstep31 or to use more register and operation to handle a full divstep62. I especially interested in finding the fastest possible inversion without constant time constraint. Thanks for your work 👍
Hello,
I implemented a 62bit var time release of your algo. On my I5 (Windows 10, Visual studio 2019), it has similar speed than safegcd but on my old xeon your method is clearly faster. I didn't have much time to investigate for the edge cases checking. I used 64bit msb for a 62 bit shift. It ends (in average) in 6.13 divstep62 against 9.00 for safegcd.
If you want to make a code review, compare implementations and make comments you can check here: https://github.com/JeanLucPons/Kangaroo/blob/e313da10d5fd3380e34c532a554ce987202d23b3/SECPK1/IntMod.cpp#L184 If you want to run it, just clone the repo, make all and run "./kangaroo -check".
Thanks :)