algebra Make FFTs nlog(d)

FFTs in the library are all nlog(n). However they could be nlog(d), which is an efficiency improvement when doing polynomial multiplication.

This can be taken advantage of in two possible ways:

Split up the FFT into n/d independent FFTs operating over the same polynomial, but evaluating on distinct cosets. Both of these sub-FFTs are dlog(d). At the end concatenate these two vectors.
Skip the first few rounds of butterflying / arithmetic in the current FFT algorithm, and just copy the polynomial coefficients into the correct positions. This is implemented in libiop.

This probably doesn't matter for the current pairing based SNARKs as they are dominated by MSM time.

Jan 14 '20 23:01 ValarDragon

I have been intending to refactor the FFT infrastructure a bit, but I don't have too much experience with this; any help would be greatly appreciated!

Other optimizations are also appreciated!

Jan 20 '20 16:01 Pratyush

What is d? What is the progress on this issue?

Apr 03 '20 06:04 jon-chuang

d is the degree of the polynomial being evaluated, and N is the size of the domain

Apr 03 '20 06:04 Pratyush

@ValarDragon 30-70 split is not really "dominated by MSM". A 10% improvement in FFT would translate to 3% overall improvement.

What I don't understand is what these cosets/domains are. I understand it when there are multiple polynomials. However, why isn't the domain just d? Don't we always mod out by the same (X^d-1)?

Apr 03 '20 06:04 jon-chuang

The FFT/MSM split depends on your circuit size and protocol choice. Agreed that improving it is important, especially at large circuit sizes.

There are multiple polynomials. RS-IOP/AHP based SNARKs commit to many polynomials, and typically do pointwise multiplications on them, hence requiring the larger domain size.

Because your protocols typically divide by vanishing polynomials, you often can't evaluate your polynomials on the systematic domain, and have to evaluate them on cosets.

Apr 03 '20 09:04 ValarDragon

This would help any time we do a polynomial multiplication, since the domain is necessarily equal to the sum of the degrees of the involved polynomial, and is hence usually at least double their degree.

Sep 29 '22 00:09 Pratyush

@andrewmilson this change becomes possible with your work in #487; if you would like to take a stab at implementing the first approach @ValarDragon mentioned, that would be fantastic!

Basically, each evaluation domain of size n can be decomposed into n/d copies of itself, multiplied by a correction factor. Eg, for n = 4, the domain D_4 equals [1, ω_4, ω_4^2, ω_4^3]. For n = 2, the domain D_2 equals [1, ω_2]. Since ω_2^2 = 1 = ω_4^4, we have that ω_2 = ω_4^2. Hence, we can decompose D_4 into D_2 and a coset domain of D_2 with offset ω_4.

Sep 29 '22 00:09 Pratyush

Thanks @Pratyush. Yeah keen to take a stab!

Sep 29 '22 01:09 andrewmilson

Awesome! Let me know if you have any questions about the approach; you can ask here or on Discord.

On Sep 28, 2022, at 6:01 PM, Andrew Milson @.***> wrote:

Thanks @Pratyush https://github.com/Pratyush. Yeah keen to take a stab!

— Reply to this email directly, view it on GitHub https://github.com/arkworks-rs/algebra/issues/83#issuecomment-1261619865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYSJ6VLKV2XS7TV5SLCEETWATS5ZANCNFSM4T4D4YUQ. You are receiving this because you were mentioned.

Sep 29 '22 01:09 Pratyush

@Pratyush unless I'm missing something I think it might be easier going straight to the second approach @ValarDragon mentioned.

With the first approach It's easy to obtain the values for the smaller FFTs evaluated on the coset domains but transposing the values so they are in order is not so easy. I haven't spent too much time looking into it but I don't think it's trivial to do without allocating extra memory.

Sep 29 '22 12:09 andrewmilson

You're right that the shuffles to put things in place are somewhat non-trivial, and might cause bad non-locality. It would be good to check what the situation would be like for the other approach.

Re: extra memory, we will need to allocate extra memory either way, since the output (evaluations) is larger than the input (polynomial)

Sep 29 '22 20:09 Pratyush

With the first approach I mean additional memory on top of resizing the coefficients to the size of the domain.

The second approach has better locality and no need to allocate more memory than necessary. I put together a quick draft PR #493 just for ideas. I know you mentioned above you are "intending to refactor the FFT infrastructure". I'd be keen to hear what your thoughts are here. I'm more than happy to take the implementation further, just let me know.

Sep 30 '22 02:09 andrewmilson

algebra algebra copied to clipboard

Make FFTs nlog(d)

algebra
algebra copied to clipboard