plonky2 Cache-friendly tranpose

Our transpose function is taking about 6-7% of our proving time. Not a huge bottleneck, so it's probably not worth investing too much effort in this. But it seems like there may be straightforward ways to improve it, like the ones mentioned in this SO question.

Jul 20 '21 16:07 dlubarov

Tentatively assigning to @unzvfu but it's low priority, and could also be something Jakub could do.

Jul 20 '21 16:07 dlubarov

Oh, we could also use parallelism if it helps. We're not currently doing anything else during the transpose calls.

Jul 21 '21 23:07 dlubarov

I’m closing this issue because this transpose is not currently a bottleneck.

Sep 13 '21 18:09 nbgl

cc @Recmo who was thinking about helping with this, using a matrix type to ensure continuity and then using a cache-friendly algorithm. Please self-assign if you end up working on it.

In case it's useful, @nbgl had a matrix type in e9be861fe250dbfc64b8e208ddc83abdd0531d69 (was never merged). He mentioned it's some of the first Rust code he wrote, though it seems reasonable to me.

I think @unzvfu also had some ideas for ASM to do the smaller transposes in a CPU-efficient way, but I guess that can be added as an extra optimization afterward.

Jan 11 '22 18:01 dlubarov

I have a cache oblivious transpose here: https://github.com/recmo/OpenZKP/blob/master/algebra/primefield/src/fft/transpose.rs

Transposition will become the bottleneck in FFT once you go beyond CPU cache sizes and have solved all the other bottlenecks. In particular you need contiguously allocated memory (i.e. a nice Matrix class) and more sophisticated strategies for very large FFTs like Radix-Sqrt: https://github.com/recmo/OpenZKP/blob/master/algebra/primefield/src/fft/radix_sqrt.rs

I'd love to implement the FFT/transpose stuff once we have contiguous allocation.

I don't think ASM optimizing the small transposes will have a big impact. As I remember things are mostly memory-fetch constrained. In my impl. adding aggressive prefetching was more impactful than tuning the small cases.

Once the Radix-Sqrt recursion is small enough to fit in L1 or L2 cache you want to use simpler algorithms again that don't involve transposition. I expect only one or two RadixSqrt recursion to be enough to reduce it to multiple smaller problems.

Apr 07 '22 20:04 recmo

plonky2 plonky2 copied to clipboard

Cache-friendly tranpose

plonky2
plonky2 copied to clipboard