librustzcash icon indicating copy to clipboard operation
librustzcash copied to clipboard

Possible circuit proving performance regression in #549

Open joebebel opened this issue 2 years ago • 9 comments

There appears to be a substantial performance regression on the proving benchmark caused by PR #549. I can't identify what code change could have possibly caused it, though, so I can't rule out that it's just my setup. However, it is consistently slower on my machine (M1 mac) vs the previous commit.

Benchmarking sapling: Collecting 10 samples in estimated 10.525 s (10 iterations                                                                                sapling                 time:   [1.0559 s 1.0753 s 1.0945 s]
                        change: [+18.403% +21.203% +23.974%] (p = 0.00 < 0.05)
                        Performance has regressed.

joebebel avatar May 06 '22 22:05 joebebel

Interesting! I was expecting to see a performance improvement due to the bls12_381 speedups. I tested locally (Ryzen 9 5950X), and can reproduce both the expected (though small) speedup from bls12_381 0.6.0 -> 0.6.1, and the performance loss from #549.

bls12_381 speedup (2e72579bed1b25eacb44e14099107846716207c1, bls12_381 0.6.0 vs 0.6.1)

sapling-spend           time:   [539.74 ms 545.43 ms 550.58 ms]                        
                        change: [-5.6546% -3.8188% -1.5558%] (p = 0.00 < 0.05)
                        Performance has improved.

Regression after migrating to ff 0.12 (144512b54719762e1d95fe4b2cee0dd7eced5491, fully-updated either side)

sapling-spend           time:   [604.80 ms 612.58 ms 620.34 ms]                        
                        change: [+10.500% +12.312% +14.200%] (p = 0.00 < 0.05)
                        Performance has regressed.

str4d avatar May 06 '22 23:05 str4d

My current suspicion is that this is caused by the migration from bitvec 0.22 to bitvec 1.

Before

image

After

image

The small box above the highlighted Iterator::nth is <u64 as bitvec::store::BitStore>::load_value.

str4d avatar May 06 '22 23:05 str4d

Confirmed:

  • If I revert to bitvec 0.22 in bellman, I recover around a quarter of the lost performance.
  • If I revert to bitvec 0.22 in ff as well, I recover basically all of the lost performance (the highlighted region in the above flamegraph disappears, and the difference between pre-ff-0.12 and post-ff-0.12-with-reverted-bitvec is within the noise threshold).

str4d avatar May 07 '22 00:05 str4d

Here in bellman::multiexp is where we are being impacted by the performance of bitvec:

        // Sort the bases into buckets
        for (exp, density) in exponents.iter().zip(density_map.as_ref().iter()) {
            if density {
                let (exp_is_zero, exp_is_one) = {
                    let (first, rest) = exp.split_first().unwrap();
                    let rest_unset = rest.not_any();
                    (!*first && rest_unset, *first && rest_unset)
                };

                if exp_is_zero {
                    bases.skip(1)?;
                } else if exp_is_one {
                    if handle_trivial {
                        acc.add_assign_from_source(&mut bases)?;
                    } else {
                        bases.skip(1)?;
                    }
                } else {
                    let exp = exp
                        .into_iter()
                        .by_vals()
                        .skip(skip as usize)
                        .take(c as usize)
                        .enumerate()
                        .fold(0u64, |acc, (i, b)| acc + ((b as u64) << i));

                    if exp != 0 {
                        (&mut buckets[(exp - 1) as usize]).add_assign_from_source(&mut bases)?;
                    } else {
                        bases.skip(1)?;
                    }
                }
            }
        }

The core problem is that this code is invoked in parallel, and each parallel worker is iterating over each exponent multiple times (both to determine whether the exponent is zero or one, and to find the section of the exponent it needs to use).

Of the regression that I see in my benchmarks, ~~I'm able to recover around half of it~~ by refactoring the multiexp API to precompute the zero-or-one check, so that it is shared across all parallel workers. EDIT: Nope, there was a bug in my edit; this only saves around 2% of the 12%.

str4d avatar May 07 '22 01:05 str4d

Okay, with additional rewriting to cache the exponent chunks (so we iterate over each exponent's bits once instead of in every parallel worker) I've managed to halve the performance regression.

Without rewrite (same env as regression benchmark above)

sapling-spend           time:   [603.11 ms 611.38 ms 619.14 ms]                        
                        change: [+9.5364% +11.547% +13.687%] (p = 0.00 < 0.05)
                        Performance has regressed.

With rewrite

sapling-spend           time:   [572.68 ms 579.70 ms 587.35 ms]                        
                        change: [+4.2425% +6.1782% +8.1863%] (p = 0.00 < 0.05)
                        Performance has regressed.

str4d avatar May 07 '22 02:05 str4d

Opened https://github.com/zkcrypto/bellman/pull/83 with the multiexp refactor.

str4d avatar May 07 '22 02:05 str4d

And #551 now includes the change. @joebebel could you test that PR and see what kind of improvement you get?

str4d avatar May 07 '22 03:05 str4d

sapling-spend-prove time: [914.88 ms 924.71 ms 933.64 ms], so that's what, about 14% faster?

If there is a performance regression in bitvec 1, I assume it can also be fixed in addition to refactoring multiexp?

joebebel avatar May 07 '22 03:05 joebebel

Possibly, but something else appears to also be going on here. I tried manually reverting to bitvec 0.22.3 via patching, and while that does gain a bit of performance back, I still see around a 4% regression relative to before #549. So there's something else going on here that I don't yet understand.

str4d avatar May 07 '22 16:05 str4d