crypto-bigint icon indicating copy to clipboard operation
crypto-bigint copied to clipboard

Binary GCD

Open erik-3milabs opened this issue 11 months ago • 9 comments

This PR introduces an implementation of Optimized Binary GCD. Ref: Pornin, Algorithm 2.

Upsides to this technique:

  • it is up to 27x faster than the gcd algorithm currently implemented in crypto_bigint (see below) (really, it just has a different complexity bound).
  • does not need UNSAT_LIMBS
  • it is actually constant time (unlike the current implementation, which is sneakily vartime in the maximum of the bitsizes of the two operands).

Benchmark results

Word = u64

limbs gcd (vt) gcd (ct) new_gcd (ct)
2 10.687 µs 20.619 µs 3.6090 µs
4 29.121 µs 56.433 µs 7.1124 µs
8 99.819 µs 195.02 µs 16.184 µs
16 359.39 µs 710.26 µs 44.294 µs
32 1.6804 ms 3.3097 ms 136.49 µs
64 6.9717 ms 13.028 ms 494.16 µs
128 29.099 ms 57.325 ms 2.3335 ms
256 143.22 ms 244.89 ms 8.7722 ms

erik-3milabs avatar Jan 30 '25 15:01 erik-3milabs

@tarcieri To the best of my understanding this is constant-time code, but I will need some assistance verifying that

  • the compiler really compiles it into something resembling constant-time code that
  • a CPU truly executes this in constant time (and does not have some tricks up its sleeve that I don't yet know about...)

erik-3milabs avatar Jan 30 '25 16:01 erik-3milabs

After noticing the issue with #753, I reran the benchmarks here too. The only thing I noticed was that removing the old vartime benchmark improved the old const-time benchmarks, primarily for Uints with few limbs. The table with the new results is presented below.

Moreover, I ran each benchmark thrice, and observed that

  • the new_gcd benchmark results are as stable as the old gcd results, and
  • the confidence interval of the new results are of the same order of magnitude as the old results (adjusted for the shorter running time).

Based on this, I'm quite confident that the bingcd satisfies the constant time requirement at least as well as the current gcd implementation.

limbs gcd (ct) new_gcd (ct) factor
2 6.9073 µs 3.6055 µs 1.9x
4 42.950 µs 7.1356 µs 6.0x
8 63.511 µs 16.244 µs 3.9x
16 633.55 µs 47.897 µs 13.2x
32 3.2049 ms 136.21 µs 23.5x
64 12.787 ms 498.34 µs 25.7x
128 56.710 ms 2.3501 ms 24.1x
256 244.89 ms 8.7418 ms 28.0x

Evidently, this technique predominantly shines for larger Uints. I would not be surprised if, with some tweaks, the bingcd performance for smaller cases could be further improved.

erik-3milabs avatar Jan 31 '25 12:01 erik-3milabs

Update:

  • I refactored the code, improving readability
  • I introduced the "small" version of the algorithm that achieves slightly faster results for smaller instances (i.e., uints with fewer limbs) than the original "large" version.

Benchmarks below. Note:

  • The original gcd algorithm is unexpectedly slow for 4 limbs. Any clue what is up with that?
  • Based on these results, I have bingcd select the "small" variant up to 7 limbs and the "large" beyond that. Anything else I can do to improve this?
limbs gcd bingcd factor bingcd (small) bingcd (large)
1 2.6475 µs 363.93 ns 7.3x 342.17 ns 1.1385 µs
2 6.9068 µs 1.3533 µs 5.1x 1.2378 µs 3.2917 µs
3 12.677 µs 2.7281 µs 4.6x 2.5826 µs 5.5247 µs
4 42.896 µs 4.7267 µs 9.1x 4.5286 µs 6.6132 µs
5 27.987 µs 7.1246 µs 3.9x 6.8353 µs 10.100 µs
6 38.466 µs 10.136 µs 3.8x 9.9228 µs 11.977 µs
7 49.654 µs 13.621 µs 3.6x 13.614 µs 14.921 µs
8 63.467 µs 15.749 µs 4.0x 17.757 µs 15.545 µs
-- -- -- -- -- --
16 632.80 µs 44.272 µs 14.3x 76.077 µs 44.593 µs
32 3.1699 ms 137.51 µs 23.1x 447.11 µs 137.29 µs
64 12.827 ms 489.57 µs 26.2x 1.7868 ms 499.83 µs
128 56.750 ms 2.3381 ms 24.3x 8.7670 ms 2.3186 ms
256 243.97 ms 8.7289 ms 27.9x 34.823 ms 8.7274 ms

erik-3milabs avatar Feb 03 '25 12:02 erik-3milabs

There is still significant room for improvement for the "large" variant. In each of its (2*Uint::BITS-1)/62 outer loop iterations, it multiplies a and b by a I64 matrix that shrinks their size by at least 62 bits. After iteration X, we thus know that either

  • a and b are at most $BITS - 62X$ bits long, or
  • the matrix is the unit matrix.

Assuming the first case, we can thus improve this multiplication by not considering some of the top limbs of a and b. Over the full algorithm, this would equal a ~1.8x reduction for time spent on those multiplications.

Aside from these multiplications, the "large" variant is linear in Uint::BITS. Interpolating from the benchmarks, the fraction of the time spent on these multiplications is roughly estimated as $$L/(L+6.85)$$ for $L$ limbs, which is already equal to ~70% for L=16 (U1024). The benefit would already be ~~30% for U1024, and increasingly more for larger limb sizes.

This would require the development of a "bounded" multiplication feature. Would this project be interested in that?

Spoiler: The upcoming xgcd feature I'm developing would enjoy the benefit ~3x.

erik-3milabs avatar Feb 03 '25 12:02 erik-3milabs

@tarcieri @ycscaly, I think this PR has stabilized for review (truly now :see_no_evil:). Looking forward to your reviews!

erik-3milabs avatar Feb 03 '25 12:02 erik-3milabs

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

tarcieri avatar Feb 14 '25 19:02 tarcieri

@erik-3milabs I just set master to v0.7.0-pre in #765

This means you can now make breaking changes, such as removing the existing safegcd implementation and changing trait impls like Gcd and InvMod to use bingcd instead

@tarcieri While I could still modify the Gcd trait to use this algorithm, this PR does not yet introduce the tools necessary to replace InvMod.

Aside from that, what else would be required to see this PR merged?

erik-3milabs avatar Mar 11 '25 14:03 erik-3milabs

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

Also seems it needs a rebase due to upstream changes.

tarcieri avatar Mar 11 '25 15:03 tarcieri

Aah, lack of invmod support would definitely be a problem. Is it something you plan on addressing eventually? My understanding is, like safegcd, that invmod is a big part of binary GCD's intended usage.

You're right. This PR only introduces the gcd algorithm; PR #761 extends the algorithm into xgcd. Stripping some things from the xgcd algorithm gives invmod. Given that I don't need invmod myself, I am not too keen on implementing it :see_no_evil:

It seems a little weird to have multiple implementations of GCD algorithms which effectively do the same thing, though to completely replace safegcd in addition to invmod you'd also need support for boxed types (though with const_mut_refs stable it should be a lot easier to share an implementation).

I agree that having two algorithms is overkill. Let me see about implementing this for Boxed<X> as well.

Also seems it needs a rebase due to upstream changes.

Yeah, you're right. Let me address that right away.

erik-3milabs avatar Mar 14 '25 08:03 erik-3milabs

I'm incredibly interested in this PR as someone who:

  1. Needs a constant-time GCD
  2. Has >50% of my runtime spent on safegcd::divsteps alone right now

I actually need an xgcd (which #761 solves), it's just trivial to inefficiently implement xgcd given gcd (by manually calculating ((a / g) % (b / g))**-1 for the first coefficient and solving from there for the second). I'd love to see this merged ASAP accordingly and want to ask if anything is a current blocker I can help with.

kayabaNerve avatar Apr 02 '25 04:04 kayabaNerve

Whoops, fat fingered the close button trying to reply.

While this is probably in OK shape as is I would preferably like to see it as the only GCD implementation, rather than having two.

I think it’s fine to retain safegcd for inversions until this can be extended to support inversions as well, but I’d really prefer for there to be one algorithm used for GCD for both boxed and unboxed types.

tarcieri avatar Apr 02 '25 04:04 tarcieri

Heard, especially if the goal of 0.7.0 is a long-lived release without continued maintenance of safegcd.

modinv should be trivial as just the x coefficient from xgcd, Some if gcd == 1, None otherwise, right?

Thanks for the quick reply and for letting me ensure I have context!

kayabaNerve avatar Apr 02 '25 07:04 kayabaNerve

@erik-3milabs do you want to look into making this the primary/only GCD implementation?

Otherwise I can potentially do that as a followup

tarcieri avatar Apr 07 '25 13:04 tarcieri

Hmm, merging this broke the build due to a few deprecations, but hopefully we can get those fixed up easily

tarcieri avatar Jun 15 '25 18:06 tarcieri