Improve performance for modular arithmetic
Right now, there are numbers we know we can't factor at compile time: see comments in #217. We're using too many steps, and clang bails. However, those comments also suggest that it should be possible to get away with far far fewer steps. It turns out that we are spending almost all of our steps on our custom implementation of mul_mod. If we can find a much more efficient implementation, we should be able to factor a much wider variety of numbers. Leading candidate options include:
- Implement Montgomery/REDC form
- Do manual "double-wide" multiplication and mod
Whichever route we go, getting a more efficient mul_mod should also make it worthwhile to switch to a "batched" implementation of Pollard's rho, which replaces many GCD calls (slow no matter what) with modular multiplication calls (which can be made fast). The batched implementation is a pretty marginal win if we implement it now (and is in some cases a little bit worse), but is a much bigger win if multiplication is cheap.