num-bigint Implement number theoretic transform for large integer multiplication

Implement number theoretic transform for large integer multiplication

Open byeongkeunahn opened this issue 1 year ago • 2 comments

This commit implements number theoretic transform (NTT) for large integer multiplication (issue #169).

To simplify implementation the Schönhage–Strassen algorithm was not used. Instead, three distinct 64-bit primes were carefully chosen to enable NTT up to ~10^18 64bit integers, which allows multiplication up to ~5 x 10^17 64bit integers. Depending on the input length either two or three primes are used, with the latter only used when the inputs consist of at least 2^40 64bit integers. The convolution results modulo primes are merged using the Chinese Remainder Theorem.
To reduce padding and the number of cycles for NTT, multiple radices are used (radix-2, radix-3, radix-4, radix-5, radix-6). Radix-8 is desirable but actually slower, presumably due to register spill. Although manual SIMD coding may alleviate this issue, it was not used (i) for maximum portability and (ii) since 64bit SIMD multiply is not widely available even on x86/x64 platforms (AVX512). The prime moduli are carefully chosen to support these radices.
- The NTT length is selected by exhaustively evaluating cost estimates for all allowed lengths within a factor of two.
Single-word (u64) Montgomery reduction is used for fast modular multiplication.
32bit digits are supported by repacking the digits into u64, running the u64 algorithm, and converting back to u32. This results in 32bit builds being about 3-5x slower compared to 64bit builds, which, however, still is an improvement upon the existing algorithms.
Unbalanced multiplication is enabled when the cost estimates are favorable.
Based on experimentation, the following thresholds are chosen:
- For u64 digits, switch to NTT if the shorter integer has at least 512 digits.
- For u32 digits, switch to NTT if the shorter integer has at least 2,048 digits.
The three primes are as follows:
- P1 = 10_237_243_632_176_332_801, Max NTT length = 2^24 * 3^20 * 5^2 = 1_462_463_376_025_190_400
- P2 = 13_649_658_176_235_110_401, Max NTT length = 2^26 * 3^19 * 5^2 = 1_949_951_168_033_587_200
- P3 = 14_259_017_916_245_606_401, Max NTT length = 2^22 * 3^21 * 5^2 = 1_096_847_532_018_892_800
- P1 and P2 are used for the two-prime NTT, whereas the three-prime NTT uses all three.
The following MIT-licensed projects are used as reference:
- http://wwwa.pikara.ne.jp/okojisan/otfft-en/index.html
- https://github.com/Bubbler-4/cg263748

On Ryzen 7 2700X, 64bit, it takes about 15ms for 2.7Mbits x 2.7Mbits and 170ms for 27Mbits x 27Mbits multiplication. This seems comparable to GMP 6.2.1.

benchmark-20230829

Aug 28 '23 14:08 byeongkeunahn

num-bigint num-bigint copied to clipboard

Implement number theoretic transform for large integer multiplication

num-bigint
num-bigint copied to clipboard