icicle
icicle copied to clipboard
double-and-add algorithm
The addition operation in projective coordinates implements operator overloading that computes a scalar multiplication using a naive double-and-add algorithm. Would be worth specializing the operation by switching it with Montgomery multiplication (CIOS) for more efficient multiplications.
Here's an efficient implementation: https://github.com/MinaProtocol/gpu-groth16-prover-3x/blob/master/multiexp/arith.cu#L289. It uses cooperative groups and thread ranks in cuda. For your codebase, it would simply require changing out the 'fixnum' calls to equivalent PTX instructions you implement in 'ptx.cuh'.
noted! thanks Tal. We will improve the double-and-add. Is the Mina code implements the state of the art? btw, feel free to make a PR
cc: @DmytroTym, @mickeyasa
Sure! It's a stripped down version from cuda-fixnum (extended modular arithmetic library) built by Hamish from Polygon Zero. It's an older repository, but it's performant: https://github.com/data61/cuda-fixnum.
I feel like two different things are at play here: modular multiplication, which is what Montgomery multiplication does, and single-scalar multiplication, which is what double-and-add does. Admittedly, we have room to improve in both. For modular multiplication, we want to see if our scheme based on Barrett reduction can be competitive with Montgomery CIOS, and if not, we'll switch to CIOS. We only have basic Barrett implemented for now. As for double-and-add, you are also right that it's suboptimal, we're planning on moving to more efficient algorithms in the near future.
To comment on this: we haven't implemented a more-efficient single-scalar multiplication than double-and-add yet, but we don't really have any use-cases for it so it's still low priority.