crypto-bigint Binary GCD

This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.

Upsides to this technique:

it is up to 27x faster than the gcd algorithm currently implemented in crypto_bigint (see below) (really, it just has a different complexity bound).
does not need UNSAT_LIMBS
it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).

Benchmark results

Word = u64

limbs	gcd (vt)	gcd (ct)	new_gcd (ct)
2	10.687 µs	20.619 µs	3.6090 µs
4	29.121 µs	56.433 µs	7.1124 µs
8	99.819 µs	195.02 µs	16.184 µs
16	359.39 µs	710.26 µs	44.294 µs
32	1.6804 ms	3.3097 ms	136.49 µs
64	6.9717 ms	13.028 ms	494.16 µs
128	29.099 ms	57.325 ms	2.3335 ms
256	143.22 ms	244.89 ms	8.7722 ms

Jan 30 '25 15:01 erik-3milabs

@tarcieri To the best of my understanding this is constant-time code, but I will need some assistance verifying that

the compiler really compiles it into something resembling constant-time code that
a CPU truly executes this in constant time (and does not have some tricks up its sleeve that I don't yet know about...)

Jan 30 '25 16:01 erik-3milabs

After noticing the issue with #753, I reran the benchmarks here too. The only thing I noticed was that removing the old vartime benchmark improved the old const-time benchmarks, primarily for Uints with few limbs. The table with the new results is presented below.

Moreover, I ran each benchmark thrice, and observed that

the new_gcd benchmark results are as stable as the old gcd results, and
the confidence interval of the new results are of the same order of magnitude as the old results (adjusted for the shorter running time).

Based on this, I'm quite confident that the bingcd satisfies the constant time requirement at least as well as the current gcd implementation.

limbs	gcd (ct)	new_gcd (ct)	factor
2	6.9073 µs	3.6055 µs	1.9x
4	42.950 µs	7.1356 µs	6.0x
8	63.511 µs	16.244 µs	3.9x
16	633.55 µs	47.897 µs	13.2x
32	3.2049 ms	136.21 µs	23.5x
64	12.787 ms	498.34 µs	25.7x
128	56.710 ms	2.3501 ms	24.1x
256	244.89 ms	8.7418 ms	28.0x

Evidently, this technique predominantly shines for larger Uints. I would not be surprised if, with some tweaks, the bingcd performance for smaller cases could be further improved.

Jan 31 '25 12:01 erik-3milabs

Update:

I refactored the code, improving readability
I introduced the "small" version of the algorithm that achieves slightly faster results for smaller instances (i.e., uints with fewer limbs) than the original "large" version.

Benchmarks below. Note:

The original gcd algorithm is unexpectedly slow for 4 limbs. Any clue what is up with that?
Based on these results, I have bingcd select the "small" variant up to 7 limbs and the "large" beyond that. Anything else I can do to improve this?

limbs	gcd	bingcd	factor	bingcd (small)	bingcd (large)
1	2.6475 µs	363.93 ns	7.3x	342.17 ns	1.1385 µs
2	6.9068 µs	1.3533 µs	5.1x	1.2378 µs	3.2917 µs
3	12.677 µs	2.7281 µs	4.6x	2.5826 µs	5.5247 µs
4	42.896 µs	4.7267 µs	9.1x	4.5286 µs	6.6132 µs
5	27.987 µs	7.1246 µs	3.9x	6.8353 µs	10.100 µs
6	38.466 µs	10.136 µs	3.8x	9.9228 µs	11.977 µs
7	49.654 µs	13.621 µs	3.6x	13.614 µs	14.921 µs
8	63.467 µs	15.749 µs	4.0x	17.757 µs	15.545 µs
--	--	--	--	--	--
16	632.80 µs	44.272 µs	14.3x	76.077 µs	44.593 µs
32	3.1699 ms	137.51 µs	23.1x	447.11 µs	137.29 µs
64	12.827 ms	489.57 µs	26.2x	1.7868 ms	499.83 µs
128	56.750 ms	2.3381 ms	24.3x	8.7670 ms	2.3186 ms
256	243.97 ms	8.7289 ms	27.9x	34.823 ms	8.7274 ms

Feb 03 '25 12:02 erik-3milabs

There is still significant room for improvement for the "large" variant. In each of its (2*Uint::BITS-1)/62 outer loop iterations, it multiplies a and b by a I64 matrix that shrinks their size by at least 62 bits. After iteration X, we thus know that either

a and b are at most $BITS - 62X$ bits long, or
the matrix is the unit matrix.

Assuming the first case, we can thus improve this multiplication by not considering some of the top limbs of a and b. Over the full algorithm, this would equal a ~1.8x reduction for time spent on those multiplications.

Aside from these multiplications, the "large" variant is linear in Uint::BITS. Interpolating from the benchmarks, the fraction of the time spent on these multiplications is roughly estimated as $$L/(L+6.85)$$ for $L$ limbs, which is already equal to ~70% for L=16 (U1024). The benefit would already be ~~30% for U1024, and increasingly more for larger limb sizes.

This would require the development of a "bounded" multiplication feature. Would this project be interested in that?

Spoiler: The upcoming xgcd feature I'm developing would enjoy the benefit ~3x.

Feb 03 '25 12:02 erik-3milabs

@tarcieri @ycscaly, I think this PR has stabilized for review (truly now :see_no_evil:). Looking forward to your reviews!

Feb 03 '25 12:02 erik-3milabs

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

Feb 14 '25 19:02 tarcieri

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

@tarcieri While I could still modify the Gcd trait to use this algorithm, this PR does not yet introduce the tools necessary to replace InvMod.

Aside from that, what else would be required to see this PR merged?

Mar 11 '25 14:03 erik-3milabs

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

Also seems it needs a rebase due to upstream changes.

Mar 11 '25 15:03 tarcieri

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

You're right. This PR only introduces the gcd algorithm; PR #761 extends the algorithm into xgcd. Stripping some things from the xgcd algorithm gives invmod. Given that I don't need invmod myself, I am not too keen on implementing it :see_no_evil:

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

I agree that having two algorithms is overkill. Let me see about implementing this for Boxed<X> as well.

Also seems it needs a rebase due to upstream changes.

Yeah, you're right. Let me address that right away.

Mar 14 '25 08:03 erik-3milabs

I'm incredibly interested in this PR as someone who:

Needs a constant-time GCD
Has >50% of my runtime spent on safegcd::divsteps alone right now

I actually need an xgcd (which #761 solves), it's just trivial to inefficiently implement xgcd given gcd (by manually calculating ((a / g) % (b / g))**-1 for the first coefficient and solving from there for the second). I'd love to see this merged ASAP accordingly and want to ask if anything is a current blocker I can help with.

Apr 02 '25 04:04 kayabaNerve

Whoops, fat fingered the close button trying to reply.

While this is probably in OK shape as is I would preferably like to see it as the only GCD implementation, rather than having two.

I think it’s fine to retain safegcd for inversions until this can be extended to support inversions as well, but I’d really prefer for there to be one algorithm used for GCD for both boxed and unboxed types.

Apr 02 '25 04:04 tarcieri

Heard, especially if the goal of 0.7.0 is a long-lived release without continued maintenance of safegcd.

modinv should be trivial as just the x coefficient from xgcd, Some if gcd == 1, None otherwise, right?

Thanks for the quick reply and for letting me ensure I have context!

Apr 02 '25 07:04 kayabaNerve

@erik-3milabs do you want to look into making this the primary/only GCD implementation?

Otherwise I can potentially do that as a followup

Apr 07 '25 13:04 tarcieri

Hmm, merging this broke the build due to a few deprecations, but hopefully we can get those fixed up easily

Jun 15 '25 18:06 tarcieri