plonky2 icon indicating copy to clipboard operation
plonky2 copied to clipboard

Faster multiplication by generator

Open ValarDragon opened this issue 2 years ago • 2 comments

Was skimming the code, and couldn't find any sped up routine for multiplying by the primitive root for the goldilocks field.

I believe this is an operation that should be getting done on O(N) points for an N-element proof, due to taking a LDE of the input onto a coset. (Or having the codeword be a coset, though my recollection is thats actually slower)

The goldilocks field's smallest generator is 7, so should have a faster method of computing it via additions. If theres a faster mul_by_2 routine on this field, then it could be (roughly):

def mul_by_primitive_root(x):
  x2 = x.mul2()
  x3 = x2
  x3 += x
  x6 = x3.mul2()
  x7 = x6
  x7 += x
  return x7

Unknown how many times in the code you actually multiply by the primitive root, or how much of a speedup factor this would be in the small field, but thought I'd mention it in case it helps. (In Arkworks for BLS12-381, I at some point thought this would be more than 2x as fast as the general multiplication. But thats a much larger field, and because multiplying by two was twice as fast as normal addition)

ValarDragon avatar Jan 27 '23 08:01 ValarDragon

I haven't benchmarked it in the context of plonky2, so my comment is to be taken with a pinch of salt, but with this field size, the ratio between addition and multiplication is much smaller than with large fields (like on BLS12-381 for instance where it is around 4x/5x if I recall correctly?). In our curve implementation, the ratio is around 1.4, so I doubt such approach would yield any advantage, but maybe my assumptions are wrong with how arithmetic is performed here, especially with assembly instructions..

EDIT: To complete, in our approach, multiplying by the generator (or any u32 value) is done through a specific mul_by_u32() method which mimics the reduction of regular field multiplication, but ignores the highest limb (which is 0 here), effectively being faster than regular multiplication.

Nashtare avatar Jan 27 '23 13:01 Nashtare

Yeah seems like a good idea to special case 7. As Robin said it might be best to still use multiplication, but the reduction should be somewhat faster.

I think a 96-bit reduction would have the same cost as a 67-bit reduction (cc @nbgl, correct me if wrong), so we could implement mul_by_u32 and it might have other uses.

We used to do something like that for our (now deleted) CrandallField, and there's still a from_noncanonical_u96 method in the code, though it's not implemented yet for GoldilocksField.

I don't think multiplication by a generator is a significant cost for us so we might not immediately notice a speedup, but it could add up if we also used mul_by_u32 when evaluating certain constraints etc.

dlubarov avatar Jan 27 '23 20:01 dlubarov