librustzcash
librustzcash copied to clipboard
Possible circuit proving performance regression in #549
There appears to be a substantial performance regression on the proving benchmark caused by PR #549. I can't identify what code change could have possibly caused it, though, so I can't rule out that it's just my setup. However, it is consistently slower on my machine (M1 mac) vs the previous commit.
Benchmarking sapling: Collecting 10 samples in estimated 10.525 s (10 iterations sapling time: [1.0559 s 1.0753 s 1.0945 s]
change: [+18.403% +21.203% +23.974%] (p = 0.00 < 0.05)
Performance has regressed.
Interesting! I was expecting to see a performance improvement due to the bls12_381
speedups. I tested locally (Ryzen 9 5950X), and can reproduce both the expected (though small) speedup from bls12_381 0.6.0 -> 0.6.1
, and the performance loss from #549.
bls12_381
speedup (2e72579bed1b25eacb44e14099107846716207c1, bls12_381 0.6.0 vs 0.6.1
)
sapling-spend time: [539.74 ms 545.43 ms 550.58 ms]
change: [-5.6546% -3.8188% -1.5558%] (p = 0.00 < 0.05)
Performance has improved.
Regression after migrating to ff 0.12
(144512b54719762e1d95fe4b2cee0dd7eced5491, fully-updated either side)
sapling-spend time: [604.80 ms 612.58 ms 620.34 ms]
change: [+10.500% +12.312% +14.200%] (p = 0.00 < 0.05)
Performance has regressed.
My current suspicion is that this is caused by the migration from bitvec 0.22
to bitvec 1
.
Before
After
The small box above the highlighted Iterator::nth
is <u64 as bitvec::store::BitStore>::load_value
.
Confirmed:
- If I revert to
bitvec 0.22
inbellman
, I recover around a quarter of the lost performance. - If I revert to
bitvec 0.22
inff
as well, I recover basically all of the lost performance (the highlighted region in the above flamegraph disappears, and the difference between pre-ff-0.12 and post-ff-0.12-with-reverted-bitvec is within the noise threshold).
Here in bellman::multiexp
is where we are being impacted by the performance of bitvec
:
// Sort the bases into buckets
for (exp, density) in exponents.iter().zip(density_map.as_ref().iter()) {
if density {
let (exp_is_zero, exp_is_one) = {
let (first, rest) = exp.split_first().unwrap();
let rest_unset = rest.not_any();
(!*first && rest_unset, *first && rest_unset)
};
if exp_is_zero {
bases.skip(1)?;
} else if exp_is_one {
if handle_trivial {
acc.add_assign_from_source(&mut bases)?;
} else {
bases.skip(1)?;
}
} else {
let exp = exp
.into_iter()
.by_vals()
.skip(skip as usize)
.take(c as usize)
.enumerate()
.fold(0u64, |acc, (i, b)| acc + ((b as u64) << i));
if exp != 0 {
(&mut buckets[(exp - 1) as usize]).add_assign_from_source(&mut bases)?;
} else {
bases.skip(1)?;
}
}
}
}
The core problem is that this code is invoked in parallel, and each parallel worker is iterating over each exponent multiple times (both to determine whether the exponent is zero or one, and to find the section of the exponent it needs to use).
Of the regression that I see in my benchmarks, ~~I'm able to recover around half of it~~ by refactoring the multiexp
API to precompute the zero-or-one check, so that it is shared across all parallel workers. EDIT: Nope, there was a bug in my edit; this only saves around 2% of the 12%.
Okay, with additional rewriting to cache the exponent chunks (so we iterate over each exponent's bits once instead of in every parallel worker) I've managed to halve the performance regression.
Without rewrite (same env as regression benchmark above)
sapling-spend time: [603.11 ms 611.38 ms 619.14 ms]
change: [+9.5364% +11.547% +13.687%] (p = 0.00 < 0.05)
Performance has regressed.
With rewrite
sapling-spend time: [572.68 ms 579.70 ms 587.35 ms]
change: [+4.2425% +6.1782% +8.1863%] (p = 0.00 < 0.05)
Performance has regressed.
Opened https://github.com/zkcrypto/bellman/pull/83 with the multiexp
refactor.
And #551 now includes the change. @joebebel could you test that PR and see what kind of improvement you get?
sapling-spend-prove time: [914.88 ms 924.71 ms 933.64 ms]
, so that's what, about 14% faster?
If there is a performance regression in bitvec 1
, I assume it can also be fixed in addition to refactoring multiexp
?
Possibly, but something else appears to also be going on here. I tried manually reverting to bitvec 0.22.3
via patching, and while that does gain a bit of performance back, I still see around a 4% regression relative to before #549. So there's something else going on here that I don't yet understand.