CGBN `cgbn_mont_sqr(r, a, modulus)` returns result

`cgbn_mont_sqr(r, a, modulus)` returns result > modulus

Open sethtroisi opened this issue 3 years ago • 4 comments

I have found a case with N=971 bits where cgbn_mont_sqr(r, a, n) returns (a^2 % n) + n

I made a minimal repro branch here

Would you be willing to assist in debugging? Any ideas what might be wrong?

Aug 18 '21 21:08 sethtroisi

I increased the reproducibility by setting

a = n - bn2mont(1)

With BITS=1024 TPI={8,16,32}, this fails for all 2^n-1, n >= 683 With BITS=512 TPI={8,16}, this fails for all 2^n-1, n>=342 With BITS=256 TPI=8, this fails for all 2^n-1 >= 171 This is probably n = 2/3 BITS

With BITS=512, TPI=32, seems to work on the tested inputs With BITS=256, TPI={16,32}, seems to work on the tested inputs With BITS=128, TPI={8,16,32}, seems to work on the tested inputs

Aug 18 '21 22:08 sethtroisi

This error is present if I use core_mul_xmad.cu or core_mul_imad.cu

Aug 18 '21 22:08 sethtroisi

I looked at testing this with unit_test but I found that experience hard. Is it expected that make takes ~90 seconds for tester? Is there a faster way to write a single unit test that tests cpu(mpz) vs gpu(CGBN)?

Aug 19 '21 00:08 sethtroisi

Hi Seth,

Thanks for reporting. You have found a problem here, but it's more of a documentation issue than a coding issue. The mont_mul and mont_sqr routines use a redundant representation, and only subtract N if there was a carry out -- this is faster because it saves a comparison on each mont_sqr and mont_mul. The mont2bn routine is the final step and ensures that the result is normalized to N. I think the implementation is right, but it should be documented better about what's going on under the covers.

I'll leave this issue open until I update the docs.

Thank you, Niall

Aug 19 '21 09:08 nemmart

CGBN CGBN copied to clipboard

`cgbn_mont_sqr(r, a, modulus)` returns result > modulus

CGBN
CGBN copied to clipboard