sp1
sp1 copied to clipboard
Benchmark and accelerate `bn254` precompiles in `revm`
Background: revm
uses the substrate-bn
crate for implementing the bn254
pairing precompile in EVM.
We recently added a syscall sys_bigint
that precompiles uint256
mulmod
and can be used to accelerate bn254
computations.
- [ ] Profile
revm
bn254
precompiles (add
,mul
,pairing
) zkVM performance by writing an example program (can use one of their test vectors) that uses the various bn254 precompiles and looking at the cycle count. - [ ] Patch
substrate-bn
: https://github.com/paritytech/bn to use oursys_bigint
precompile (or other existingbn254
precompiles, we have ones foradd
anddouble
here) to accelerate its performance - [ ] Benchmark reduced cycle count of example programs
Also some helpful context from the folks at Nebra around whether to accelerate substrate-bn
or swap in an arkworks
implementation which might be easier to accelerate with sys_bigint
. They already have a branch that adds a backend to arkworks
with sys_bigint
that can be seen here.
...[substrate-bn] seems to have montgomery representation hard-coded into it. This was also the case with the halo2 code, which was the other lib I considered. Given the mulmod syscall, which works with plain bigints (non-montgomery), it seemed like sticking with plain representation would be the best (IIRC it's only muls that benefit from Montgomery form, so it would add unnecessary overhead inside the VM if we kept that form). The arkworks impl has a "backend" component factored out, so it seemed easiest to just provide a plain (non-Montgomery) backend.
@puma314 Should the revm profiling example code be done here or in the revm codebase?
I got some rough numbers (single sample) generated with https://github.com/m-kus/sp1-bn254-benchmark
Operation | # cycles |
---|---|
G1 decoding (uncompressed) | 10,194 |
G1 encoding | 119,182 |
G1 addition | 30,034 |
G1 multiplication | 1,735,257 |
G2 decoding (uncompressed) | 27,958,488 |
G2 encoding | 162,243 |
G2 addition | 104,122 |
G2 multiplication | 31,221,329 |
Miller loop | 33,175,157 |
Final exponentiation | 45,323,747 |
Substrate BN is widely used (and will probably continue to be used since it's time/battle tested) and ZK friendliness might not be enough incentive to swap to arkworks I guess, given that it's extra work (I might be wrong though)
I got similar results when using substrate-bn
. I tried arkworks
as is, and also patched it by using this approach (mentioned above) that relies on the mulmod
precompile.
Here are the results (copied the table from @m-kus for a better comparison):
Operation | substrate-bn | arkworks | arkworks (patched) |
---|---|---|---|
G1 decoding (uncompressed) | 10,194 | n/a | n/a |
G1 encoding | 119,182 | n/a | n/a |
G1 addition | 30,034 | 17,134 | 3,964 |
G1 multiplication | 1,735,257 | 4,494,212 (?) | 918,631 |
G2 decoding (uncompressed) | 27,958,488 | n/a | n/a |
G2 encoding | 162,243 | n/a | n/a |
G2 addition | 104,122 | 49,403 | 17,109 |
G2 multiplication | 31,221,329 | 13,965,273 | 4,667,998 |
Miller loop | 33,175,157 | 16,487,911 | 6,391,641 |
Final exponentiation | 45,323,747 | 16,875,843 | 8,165,645 |
Note: there's a precompile for sw points addition, and when using it directly, the cycle count for "G1 addition" drops to ~1k.
I'll post a link to the benchmark repo in the next days.
I was able to patch substrate-bn
and switch from Montogomery form to the plain representation while keeping the same API. The final results are:
Operation | substrate-bn | substrate-bn-sp1 (patched) |
---|---|---|
G1 decoding (uncompressed) | 10,194 | 2,022 |
G1 encoding | 119,182 | 101,621 |
G1 addition | 30,034 | 5,301 |
G1 multiplication | 1,735,257 | 402,823 |
G2 decoding (uncompressed) | 27,958,488 | 7,798,137 |
G2 encoding | 162,243 | 117,460 |
G2 addition | 104,122 | 31,819 |
G2 multiplication | 31,221,329 | 8,644,149 |
Miller loop | 33,175,157 | 9,449,627 |
Final exponentiation | 45,323,747 | 14,574,801 |
revm_precompile::bn128::run_add |
168,171 | 113,580 |
revm_precompile::bn128::run_mul |
1,865,971 | 506,454 |
revm_precompile::bn128::run_pair |
213,099,695 | 63,800,732 |
Patched substrate-bn
crate: https://github.com/m-kus/substrate-bn-sp1/pull/1
Benchmark sources: https://github.com/m-kus/sp1-bn254-benchmark
We are working on this internally...