algebra
algebra copied to clipboard
Merge celo/zexe w/ arkworks
Features:
- Batch Affine ops
- Batch and ordinary w-NAF
- GLV framework. GLV precomputation script.
- GLV impl for both batch affine and projective mul
- Batch bucket addition tree
- Batched MSM (1.4-1.7x bump)
- Batch subgroup verification based on addition to random buckets
- 12-limb .s asm impl for montgomery mul and constant-time add/sub ~~2D test matrix feature-gating with axes (curve, test type)~~ ~~Refactor tests to use common template~~ ~~Code timing instrumentation with fine grained control of functions and callers~~
Combined speedup (with w-NAF) for:
- Batched mul (with GLV): 2.3-2.7x.
- Non-batch projective mul (with GLV): 2.1x.
- Without GLV: 1.65-1.8x and 1.4x resp.
Speedup for subgroup verification:
- Over old impl 25-80x.
- Over new batch mul: 10-30x.
Spec sheet (also included in code comments): https://hackmd.io/@celo/S10UeWUBD
I am not willing to break this PR into smaller PRs as that was already done in Zexe, and now I have had to repeat work to merge again to arkworks, so it's too much work, no thank you.
TODOs:
- Add glv generation script.
- Sister PR for curves which implement glv, cuda, test feature gating, tests for new features etc.
- Feature gate for BigInt768 asm (.s rather than ff-asm).
Before we can merge this PR, please make sure that all the following items have been checked off. If any of the checklist items are not applicable, please leave them but write a little note why.
- [ ] Targeted PR against correct branch (main)
- [X] Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
- [X] Wrote unit tests
- [X] Updated relevant documentation in the code
- [ ] Added a relevant changelog entry to the
Pendingsection inCHANGELOG.md - [ ] Re-reviewed
Files changedin the Github PR explorer
Thanks for the PR and adapting it to arkworks!
Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.
Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)
Thanks for working on adapting these great changes to arkworks and upstreaming them!
I'd like to ideally split this PR into many smaller PR's for easier review, would you like me to take a stab at doing this, or would you like to?
To start with, I'd like to move the ark-serialize changes into its own PR, since that should be modular Similary, a separate PR for the bw6 ASM
Ah sorry, didn't see the comment regarding not being willing to split up this PR. Is it fine if Pratyush and I go through and split up this into smaller PRs, and review/merge the smaller ones accordingly?
Hi thanks all for the comments. I apologise that I have been feeling rather fatigued wrt to the PR due to the tediousness and also its size.
With regards to areas about which I have some doubts,
- the CUDA code could be improved, maybe some thinking about how we would like to organise the code/expose interfaces/conditionally compile CUDA support. Currently, there is a lot of intermixing of locations and dependencies, and no clean separation of duties. One thing I think could make the difference is to have a CUDAKernel trait that is only a bound for relevant parameters etc when conditionally compiled with feature = "cuda". This is very satisfactory to me
To be continued.
Thanks for the PR and adapting it to arkworks!
Quick question: is the GPU scalar multiplication used to implement any operations (fixed/variable base MSMs/batch subgroup checks/batch muls)? I couldn't figure it out at a quick glance.
So, this is specifically for batch multiplication. One thing I would like to improve is to create a more general framework for incorporating GPU kernels of all sorts, (including variablebaseMSM etc), which require instantiation at the parameter level due to needing to compile for specific curves without generics.
In the future, perhaps something so ugly will not have to be relied upon, e.g. if something like https://github.com/EmbarkStudios/rust-gpu/issues/387 comes to be.
Also, I'd really appreciate it if you could you could point out the parts that need the closest review. (Since this is a large PR, if you could suggest a "review plan" so that I can figure out how to navigate the PR, that would also be great.)
I will do this.
Currently I'm restarting this PR and will break into 4 components:
- batch inversion group arithmetic
- glv framework and scripts
- things using batch inversion-based methods, like batch verification and batch MSM
- cuda batch scalar mul
Number 4 is low priority as the code is spaghetti and also the API is very uncertain.
This PR is considered number 4, and I will stage the PRs so they build atop one another. Since this introduces breaking changes with curves, the corresponding curves changes will also be made.
I've decided I will complete this PR, as sickening as it is. I've already wasted enough time on porting this to arkworks that I decided I should waste even more time to at least get it done.
I'm very sickened that instead of having to write code and get things done, I have to waste time doing stupid refactoring and changing things everywhere. What a useless waste of time.
I'm thinking of migrating the GPU segments of the code towards rust-gpu, partly because
- I don't feel this segment is particularly urgent
- The workarounds to use accel and also the modifications requiring maintaining a segment of it are not worthwhile.
- accel = spaghetti code
- rust-gpu is obviously the future ;P
So there you have it. The timeline for rust-gpu to become mature enough coincides with my imagined timeline for the development of a robust gpu component to arkworks.
The major targets are compute intensive workloads like proving (MSM), batch scalar muls (currently implemented), batch verifications (on curve checks, signatures). Currently, such functionality centers around small number of functions.
Other potential things are FFTs.
Something else I would like to look into is how to combine something like rayon with rust-gpu.