plonky2
plonky2 copied to clipboard
Faster multiplication by generator
Was skimming the code, and couldn't find any sped up routine for multiplying by the primitive root for the goldilocks field.
I believe this is an operation that should be getting done on O(N) points for an N-element proof, due to taking a LDE of the input onto a coset. (Or having the codeword be a coset, though my recollection is thats actually slower)
The goldilocks field's smallest generator is 7
, so should have a faster method of computing it via additions. If theres a faster mul_by_2
routine on this field, then it could be (roughly):
def mul_by_primitive_root(x):
x2 = x.mul2()
x3 = x2
x3 += x
x6 = x3.mul2()
x7 = x6
x7 += x
return x7
Unknown how many times in the code you actually multiply by the primitive root, or how much of a speedup factor this would be in the small field, but thought I'd mention it in case it helps. (In Arkworks for BLS12-381, I at some point thought this would be more than 2x as fast as the general multiplication. But thats a much larger field, and because multiplying by two was twice as fast as normal addition)
I haven't benchmarked it in the context of plonky2, so my comment is to be taken with a pinch of salt, but with this field size, the ratio between addition and multiplication is much smaller than with large fields (like on BLS12-381 for instance where it is around 4x/5x if I recall correctly?). In our curve implementation, the ratio is around 1.4, so I doubt such approach would yield any advantage, but maybe my assumptions are wrong with how arithmetic is performed here, especially with assembly instructions..
EDIT: To complete, in our approach, multiplying by the generator (or any u32
value) is done through a specific mul_by_u32()
method which mimics the reduction of regular field multiplication, but ignores the highest limb (which is 0 here), effectively being faster than regular multiplication.
Yeah seems like a good idea to special case 7. As Robin said it might be best to still use multiplication, but the reduction should be somewhat faster.
I think a 96-bit reduction would have the same cost as a 67-bit reduction (cc @nbgl, correct me if wrong), so we could implement mul_by_u32
and it might have other uses.
We used to do something like that for our (now deleted) CrandallField
, and there's still a from_noncanonical_u96
method in the code, though it's not implemented yet for GoldilocksField
.
I don't think multiplication by a generator is a significant cost for us so we might not immediately notice a speedup, but it could add up if we also used mul_by_u32
when evaluating certain constraints etc.