plonky2
plonky2 copied to clipboard
Improve twiddle access for FFT
We implemented Bowers et al. "Improved Twiddle Access for Fast Fourier Transforms" https://doi.org/10.1109/TSP.2009.2035984
In our experiments this is 10%+ faster than the existing DIT. The algorithm turns out to be surprisingly simple because it's just like DIT, but with a DIF butterfly! 🤯
The key is that, by rearranging the computation, the twiddle is now constant within the inner-most loop, so you have less memory accesses. Because it was simple enough, I've also manually unrolled the case wˆ0 to save 1 mul.
Two things we didn't do (it's prob safer if you look at them later):
- The roots_table is no longer a table, we just need 1 row. So you could slightly simplify the API.
- I'm not sure I fully understood the trick with $r>0$ and sparse matrix... I clearly left the test so this new implementation is correct. However I removed that code, IMO with this new algo you no longer get that improvement, but I might be incorrect.
In summary:
- less memory for root_table (only 1 row)
- less memory access with const twiddle
- pretty much same algorithm & code complexity (just a diff butterfly)
- removed the special case for $r>0$, but unrolled the case $wˆ0$ (trying to maintain simplicity vs extreme perf)
- in our experiments this is 10%+ faster in pretty much all cases we tested
Interesting, thanks for the contribution! Probably @unzvfu or @nbgl would be best to review.
It looks like fft_root_table
may have a couple issues with the n=1
edge case; could you try (temporarily) changing the fft_and_ifft
test to use let degree = 1usize;
? We should probably have a separate test for n=1
, but I can add that later.
That’s a really neat trick! Looks good at first glance. I’ll need to refamiliarize myself with this (dense) code to do a thorough review, so pls be patient :)
@nbgl btw we looked at your py code for cache-oblivious. We believe we did something similar in hardware, we call it multi-layer, the idea is to reduce the number of times you cycles through all the data, and while you have a bulk of data you do multiple ntt layers on that. In hw we implemented 3 and 6 stages, that leads to exactly 3x and 6x perf.
In sw we didn't really do any analysis but it's definitely interesting. In our experience this is completely orthogonal to the twiddles trick, so you can benefit from both. I can't share the paper publicly yet because it's under review, but if you reach out via email I'm definitely happy to send it over, if you're looking into this stuff.
I had a look at these changes (sorry it took a while) and they look good, except for two things:
- As you mentioned,
FftRootTable
should be simplified; it doesn’t make sense for it to be a vector anymore, as it’s only going to have one element. - The code crashes on AVX2/AVX-512, because
PackedField
s are not handled quite correctly.
I’ve made those changes on a separate branch (see jacqui/bowers
); feel free to look over them, merge them into your branch, and then we can merge this PR.
I’d also like to keep the r
parameter. It’s not currently unused; it’s an optimization and not required for correctness, but the optimization still speeds up a bunch of cases. I’m okay with making this a TODO for now and I can fix it later.
@0x0ece The cache-oblivious FFT algorithm is not actually my invention; see [1]. The benefit is that you get asymptotically better cache complexity, and of course memory becomes the main bottleneck for big enough inputs.
[1] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. 2012. Cache-Oblivious Algorithms. ACM Trans. Algorithms 8, 1, Article 4 (January 2012), 22 pages. https://doi.org/10.1145/2071379.2071383
Oh, I should mention: vectorization only happens on x86 with AVX2 or AVX-512 when compiled with -Ctarget-cpu=native
. That’s how I usually test PackedField
s code
Thank you much @nbgl, I merged your commits. Sorry I didn't realized the -Ctarget-cpu=native
, will make sure to use it in future, thank you for the hint!
As for the cache, I guess what I wanted to say is that following the literature approach there's a risk on the transpose.
It's similar to this code here, you can save in ram in shuffled location, or you could save in order and then call the reverse_index_bits_in_place
. The latter is slower (about 100ms at 2ˆ24) because you're writing twice.
What we developed is essentially a streaming version of the transpose. Again, we did it for FPGA, but it'd be nice to see if it also works in software.
Hi @nbgl, just checking in to see if you need anything else for this PR.
Hello!
This PR has been open for a long time, please review and update as following:
- If this PR is no l longer relevant, close it.
- If this PR is relevant but not read, mark it as Draft and make a comment of what else needs to be done.
- Fix any issues and get it merged / reviewed by 10/27/2023.
If there's no update by 10/27/2023 this PR will be closed.
Hey @0x0ece, apologies for never getting back to you!
I think we could still be interested in merging this. Could you rebase with latest main
, so that we can proceed with another round of review?
Also I believe the comment from @nbgl to keep using the r
parameter hasn't been applied, was there any particular rationale against this?