librustzcash Possible circuit proving performance regression in #549

There appears to be a substantial performance regression on the proving benchmark caused by PR #549. I can't identify what code change could have possibly caused it, though, so I can't rule out that it's just my setup. However, it is consistently slower on my machine (M1 mac) vs the previous commit.

Benchmarking sapling: Collecting 10 samples in estimated 10.525 s (10 iterations                                                                                sapling                 time:   [1.0559 s 1.0753 s 1.0945 s]
                        change: [+18.403% +21.203% +23.974%] (p = 0.00 < 0.05)
                        Performance has regressed.

May 06 '22 22:05 joebebel

Interesting! I was expecting to see a performance improvement due to the bls12_381 speedups. I tested locally (Ryzen 9 5950X), and can reproduce both the expected (though small) speedup from bls12_381 0.6.0 -> 0.6.1, and the performance loss from #549.

`bls12_381` speedup (2e72579bed1b25eacb44e14099107846716207c1, `bls12_381 0.6.0 vs 0.6.1`)

sapling-spend           time:   [539.74 ms 545.43 ms 550.58 ms]                        
                        change: [-5.6546% -3.8188% -1.5558%] (p = 0.00 < 0.05)
                        Performance has improved.

Regression after migrating to `ff 0.12` (144512b54719762e1d95fe4b2cee0dd7eced5491, fully-updated either side)

sapling-spend           time:   [604.80 ms 612.58 ms 620.34 ms]                        
                        change: [+10.500% +12.312% +14.200%] (p = 0.00 < 0.05)
                        Performance has regressed.

May 06 '22 23:05 str4d

My current suspicion is that this is caused by the migration from bitvec 0.22 to bitvec 1.

Before

After

The small box above the highlighted Iterator::nth is <u64 as bitvec::store::BitStore>::load_value.

May 06 '22 23:05 str4d

Confirmed:

If I revert to bitvec 0.22 in bellman, I recover around a quarter of the lost performance.
If I revert to bitvec 0.22 in ff as well, I recover basically all of the lost performance (the highlighted region in the above flamegraph disappears, and the difference between pre-ff-0.12 and post-ff-0.12-with-reverted-bitvec is within the noise threshold).

May 07 '22 00:05 str4d

Here in bellman::multiexp is where we are being impacted by the performance of bitvec:

        // Sort the bases into buckets
        for (exp, density) in exponents.iter().zip(density_map.as_ref().iter()) {
            if density {
                let (exp_is_zero, exp_is_one) = {
                    let (first, rest) = exp.split_first().unwrap();
                    let rest_unset = rest.not_any();
                    (!*first && rest_unset, *first && rest_unset)
                };

                if exp_is_zero {
                    bases.skip(1)?;
                } else if exp_is_one {
                    if handle_trivial {
                        acc.add_assign_from_source(&mut bases)?;
                    } else {
                        bases.skip(1)?;
                    }
                } else {
                    let exp = exp
                        .into_iter()
                        .by_vals()
                        .skip(skip as usize)
                        .take(c as usize)
                        .enumerate()
                        .fold(0u64, |acc, (i, b)| acc + ((b as u64) << i));

                    if exp != 0 {
                        (&mut buckets[(exp - 1) as usize]).add_assign_from_source(&mut bases)?;
                    } else {
                        bases.skip(1)?;
                    }
                }
            }
        }

The core problem is that this code is invoked in parallel, and each parallel worker is iterating over each exponent multiple times (both to determine whether the exponent is zero or one, and to find the section of the exponent it needs to use).

Of the regression that I see in my benchmarks, ~~I'm able to recover around half of it~~ by refactoring the multiexp API to precompute the zero-or-one check, so that it is shared across all parallel workers. EDIT: Nope, there was a bug in my edit; this only saves around 2% of the 12%.

May 07 '22 01:05 str4d

Okay, with additional rewriting to cache the exponent chunks (so we iterate over each exponent's bits once instead of in every parallel worker) I've managed to halve the performance regression.

Without rewrite (same env as regression benchmark above)

sapling-spend           time:   [603.11 ms 611.38 ms 619.14 ms]                        
                        change: [+9.5364% +11.547% +13.687%] (p = 0.00 < 0.05)
                        Performance has regressed.

With rewrite

sapling-spend           time:   [572.68 ms 579.70 ms 587.35 ms]                        
                        change: [+4.2425% +6.1782% +8.1863%] (p = 0.00 < 0.05)
                        Performance has regressed.

May 07 '22 02:05 str4d

Opened https://github.com/zkcrypto/bellman/pull/83 with the multiexp refactor.

May 07 '22 02:05 str4d

And #551 now includes the change. @joebebel could you test that PR and see what kind of improvement you get?

May 07 '22 03:05 str4d

sapling-spend-prove time: [914.88 ms 924.71 ms 933.64 ms], so that's what, about 14% faster?

If there is a performance regression in bitvec 1, I assume it can also be fixed in addition to refactoring multiexp?

May 07 '22 03:05 joebebel

Possibly, but something else appears to also be going on here. I tried manually reverting to bitvec 0.22.3 via patching, and while that does gain a bit of performance back, I still see around a 4% regression relative to before #549. So there's something else going on here that I don't yet understand.

May 07 '22 16:05 str4d

librustzcash librustzcash copied to clipboard

Possible circuit proving performance regression in #549

bls12_381 speedup (2e72579bed1b25eacb44e14099107846716207c1, bls12_381 0.6.0 vs 0.6.1)

Regression after migrating to ff 0.12 (144512b54719762e1d95fe4b2cee0dd7eced5491, fully-updated either side)

Before

After

Without rewrite (same env as regression benchmark above)

With rewrite

librustzcash
librustzcash copied to clipboard

`bls12_381` speedup (2e72579bed1b25eacb44e14099107846716207c1, `bls12_381 0.6.0 vs 0.6.1`)

Regression after migrating to `ff 0.12` (144512b54719762e1d95fe4b2cee0dd7eced5491, fully-updated either side)