Binary GCD
This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.
Upsides to this technique:
- it is up to 27x faster than the
gcdalgorithm currently implemented incrypto_bigint(see below) (really, it just has a different complexity bound). - does not need
UNSAT_LIMBS - it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).
Benchmark results
Word = u64
| limbs | gcd (vt) | gcd (ct) | new_gcd (ct) |
|---|---|---|---|
| 2 | 10.687 µs | 20.619 µs | 3.6090 µs |
| 4 | 29.121 µs | 56.433 µs | 7.1124 µs |
| 8 | 99.819 µs | 195.02 µs | 16.184 µs |
| 16 | 359.39 µs | 710.26 µs | 44.294 µs |
| 32 | 1.6804 ms | 3.3097 ms | 136.49 µs |
| 64 | 6.9717 ms | 13.028 ms | 494.16 µs |
| 128 | 29.099 ms | 57.325 ms | 2.3335 ms |
| 256 | 143.22 ms | 244.89 ms | 8.7722 ms |
@tarcieri To the best of my understanding this is constant-time code, but I will need some assistance verifying that
- the compiler really compiles it into something resembling constant-time code that
- a CPU truly executes this in constant time (and does not have some tricks up its sleeve that I don't yet know about...)
After noticing the issue with #753, I reran the benchmarks here too. The only thing I noticed was that removing the old vartime benchmark improved the old const-time benchmarks, primarily for Uints with few limbs. The table with the new results is presented below.
Moreover, I ran each benchmark thrice, and observed that
- the
new_gcdbenchmark results are as stable as the old gcd results, and - the confidence interval of the new results are of the same order of magnitude as the old results (adjusted for the shorter running time).
Based on this, I'm quite confident that the bingcd satisfies the constant time requirement at least as well as the current gcd implementation.
| limbs | gcd (ct) | new_gcd (ct) | factor |
|---|---|---|---|
| 2 | 6.9073 µs | 3.6055 µs | 1.9x |
| 4 | 42.950 µs | 7.1356 µs | 6.0x |
| 8 | 63.511 µs | 16.244 µs | 3.9x |
| 16 | 633.55 µs | 47.897 µs | 13.2x |
| 32 | 3.2049 ms | 136.21 µs | 23.5x |
| 64 | 12.787 ms | 498.34 µs | 25.7x |
| 128 | 56.710 ms | 2.3501 ms | 24.1x |
| 256 | 244.89 ms | 8.7418 ms | 28.0x |
Evidently, this technique predominantly shines for larger Uints. I would not be surprised if, with some tweaks, the bingcd performance for smaller cases could be further improved.
Update:
- I refactored the code, improving readability
- I introduced the "small" version of the algorithm that achieves slightly faster results for smaller instances (i.e., uints with fewer limbs) than the original "large" version.
Benchmarks below. Note:
- The original
gcdalgorithm is unexpectedly slow for 4 limbs. Any clue what is up with that? - Based on these results, I have
bingcdselect the "small" variant up to 7 limbs and the "large" beyond that. Anything else I can do to improve this?
| limbs | gcd | bingcd | factor | bingcd (small) | bingcd (large) |
|---|---|---|---|---|---|
| 1 | 2.6475 µs | 363.93 ns | 7.3x | 342.17 ns | 1.1385 µs |
| 2 | 6.9068 µs | 1.3533 µs | 5.1x | 1.2378 µs | 3.2917 µs |
| 3 | 12.677 µs | 2.7281 µs | 4.6x | 2.5826 µs | 5.5247 µs |
| 4 | 42.896 µs | 4.7267 µs | 9.1x | 4.5286 µs | 6.6132 µs |
| 5 | 27.987 µs | 7.1246 µs | 3.9x | 6.8353 µs | 10.100 µs |
| 6 | 38.466 µs | 10.136 µs | 3.8x | 9.9228 µs | 11.977 µs |
| 7 | 49.654 µs | 13.621 µs | 3.6x | 13.614 µs | 14.921 µs |
| 8 | 63.467 µs | 15.749 µs | 4.0x | 17.757 µs | 15.545 µs |
| -- | -- | -- | -- | -- | -- |
| 16 | 632.80 µs | 44.272 µs | 14.3x | 76.077 µs | 44.593 µs |
| 32 | 3.1699 ms | 137.51 µs | 23.1x | 447.11 µs | 137.29 µs |
| 64 | 12.827 ms | 489.57 µs | 26.2x | 1.7868 ms | 499.83 µs |
| 128 | 56.750 ms | 2.3381 ms | 24.3x | 8.7670 ms | 2.3186 ms |
| 256 | 243.97 ms | 8.7289 ms | 27.9x | 34.823 ms | 8.7274 ms |
There is still significant room for improvement for the "large" variant. In each of its (2*Uint::BITS-1)/62 outer loop iterations, it multiplies a and b by a I64 matrix that shrinks their size by at least 62 bits. After iteration X, we thus know that either
-
aandbare at most $BITS - 62X$ bits long, or - the matrix is the unit matrix.
Assuming the first case, we can thus improve this multiplication by not considering some of the top limbs of a and b. Over the full algorithm, this would equal a ~1.8x reduction for time spent on those multiplications.
Aside from these multiplications, the "large" variant is linear in Uint::BITS. Interpolating from the benchmarks, the fraction of the time spent on these multiplications is roughly estimated as $$L/(L+6.85)$$ for $L$ limbs, which is already equal to ~70% for L=16 (U1024). The benefit would already be ~~30% for U1024, and increasingly more for larger limb sizes.
This would require the development of a "bounded" multiplication feature. Would this project be interested in that?
Spoiler: The upcoming xgcd feature I'm developing would enjoy the benefit ~3x.
@tarcieri @ycscaly, I think this PR has stabilized for review (truly now :see_no_evil:). Looking forward to your reviews!
@erik-3milabs I just set master to v0.7.0-pre in #765
This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead
@erik-3milabs I just set
masterto v0.7.0-pre in #765This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like
GcdandInvModto usebingcdinstead
@tarcieri While I could still modify the Gcd trait to use this algorithm, this PR does not yet introduce the tools necessary to replace InvMod.
Aside from that, what else would be required to see this PR merged?
Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.
It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).
Also seems it needs a rebase due to upstream changes.
Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.
You're right. This PR only introduces the gcd algorithm; PR #761 extends the algorithm into xgcd. Stripping some things from the xgcd algorithm gives invmod. Given that I don't need invmod myself, I am not too keen on implementing it :see_no_evil:
It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with
const_mut_refsstable it should be a lot easier to share an implementation).
I agree that having two algorithms is overkill. Let me see about implementing this for Boxed<X> as well.
Also seems it needs a rebase due to upstream changes.
Yeah, you're right. Let me address that right away.
I'm incredibly interested in this PR as someone who:
- Needs a constant-time GCD
- Has >50% of my runtime spent on
safegcd::divstepsalone right now
I actually need an xgcd (which #761 solves), it's just trivial to inefficiently implement xgcd given gcd (by manually calculating ((a / g) % (b / g))**-1 for the first coefficient and solving from there for the second). I'd love to see this merged ASAP accordingly and want to ask if anything is a current blocker I can help with.
Whoops, fat fingered the close button trying to reply.
While this is probably in OK shape as is I would preferably like to see it as the only GCD implementation, rather than having two.
I think it’s fine to retain safegcd for inversions until this can be extended to support inversions as well, but I’d really prefer for there to be one algorithm used for GCD for both boxed and unboxed types.
Heard, especially if the goal of 0.7.0 is a long-lived release without continued maintenance of safegcd.
modinv should be trivial as just the x coefficient from xgcd, Some if gcd == 1, None otherwise, right?
Thanks for the quick reply and for letting me ensure I have context!
@erik-3milabs do you want to look into making this the primary/only GCD implementation?
Otherwise I can potentially do that as a followup
Hmm, merging this broke the build due to a few deprecations, but hopefully we can get those fixed up easily