flint small prime FFT based on ulong

This PR aims to have an ulong-based version of small prime FFT. This is a draft, comments and suggestions highly welcome (on any aspect: for example I have no idea if n_fft is relevant naming).

For the moment, the features implemented are:

forward FFT, inverse FFT, transposed forward FFT, transposed inverse FFT
restriction on the modulus: it must be 62 bits at most (for performance reasons)
length power of 2 (other lengths: zero padding means non-smooth timings between powers of 2)

Performance: observed on a few different machines, AMD zen 4 and various Intel. This slightly outperforms NTL's versions of the forward and inverse FFTs (acceleration of 0% to 30% depending on lengths). This is between 2 and 4 times slower, often around 3, than the vectorized floating point-based small-prime FFT in fft_small (or than the similar AVX-based version in NTL). This version uses no simd: enabling/disabling automatic vectorization does not change performance, and a straightforward "manual" vectorization should not bring much. The reason being that every few operations there is a full 64 bit multiplication (umul_ppmm) happening. (Still, I made some experiments that suggest avx could help, maybe substantially on AMD processors which have a very fast vpmullq, but I leave this aside for later.)

Planned:

more thorough testing files (for the transposed variants, which are only tested indirectly at the moment)
cleaning things here and there, add documentation
add mechanism to avoid too memory-consuming precomputation when a root of unity of very large order is available (maybe, in a first version, simply forbid transforms of length more than 2**25 or so?).

Planned, but likely not within this PR:

truncated FFT variants, for smooth performance when length varies from one power of 2 to the next
versions with strides, useful e.g. for polynomial matrices stored as a list of matrix coefficients (e.g., might help for the half-GCD algorithm)

Nov 09 '24 10:11 vneiger

Looks great so far!

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

Nov 12 '24 10:11 fredrik-johansson

Thoughts on support (smooth) non-power-of-two-sizes as an alternative or complement to truncation?

To be sure about "(smooth) non-power-of-two-sizes". Do you mean, e.g. on a very specific case: if the size is just below $3^k$, use radix-3 FFT, with a root of unity of order $3^k$? (and more generally, only slightly overshoot the actual size by some size that factors into small primes?)

Do you plan to add threading? One of the weaknesses of fft_small for huge convolutions is that the FFTs are single-threaded. Also, huge FFTs should be bandwidth-constrained rather than arithmetic-constrained, especially if you have a lot of cores. Considering that you get nearly 25% more bits per coefficient with 62-bit modulus instead of 50-bit, n_fft should perform quite well asymptotically.

Ok, thanks for the insight. I was not sure whether threading was an important goal for small-prime FFTs, that is good to know. This was not really in my plans for the near future, because I have other things I would like to make progress on (notably nmod_poly_mat), and because I do not have much hands-on experience with threading so it is not something I could do very quick (or this could lead to poor choices/design, poor usability, poor performance, etc.). But if someone with better threading skills wants to help, I will gladly collaborate.

Nov 13 '24 08:11 vneiger