Use a much better multiply algorithm which takes advantage of modern CPUs
This multiply algorithm is optimized for multiple platforms, 32-bit and 64-bit.
Notably, instead of doing a full 128-bit product manually, it does a 64->128 bit long multiply on the low bits followed by two normal 64-bit multiplies which are added to the high bits.
This is to take advantage of native 64-bit arithmetic, and if it is supported, compiler intrinsics/extensions which do the long multiply for us.
This will use _umul128 or __uint128_t (except on wasm) if they are available. These will expand to the single operand MULQ instruction on x86_64, or MUL + UMULH on aarch64.
Otherwise, the long multiply will do a simple yet fast grade-school multiply for other targets. It is optimized to suggest the powerful UMAAL instruction on ARM, but it doesn't require it to be fast (really you only need long multiply + ADC).
Because it makes a large difference in performance, I added a check which will switch off SSE2 on GCC for x86 and attempt to switch to ARM mode on Thumb-1 targets if it is available.
Switching off SSE2 in GCC for x86 prevents a laggy partial vectorization which is detrimental to the performance (mainly because instead of switching registers we have to do PSHUFD, PUNPCKLDQ, and PSLLQ). Clang doesn't need this.
Switching to ARM mode instead of Thumb-1 is crucial to make this algorithm usable.
ARM is well known for its powerful multiplier, with the already-excellent UMULL and UMLAL instructions, and in ARMv6, the mighty UMAAL which allows a long multiply to be done in 4 instructions.
Thumb-1 does not have access to any of these, only having a 32-bit->32-bit MUL. This would result in calling compiler functions for something a simple mode switch can do in a quarter of the time or less.
This will be turned off if the M profile is detected or if _UINT128_T_FORCE_THUMB is defined.
See the comments for more details.
MSVC needs to be tested, will try to do it soon.