aztec-packages refactor(bb): optimize partially_evaluate parallelization in sumcheck

Summary

Optimizes the partially_evaluate method in sumcheck with multiple improvements to parallelization strategy, code clarity, and compile-time optimization, resulting in significant performance gains.

Changes

Core Optimizations

Wave-based parallelization for in-place updates: Implements exponentially growing waves to avoid read-write conflicts when source == destination
- Wave 0: Process pep[0] alone (reads poly[0,1])
- Wave 1: Process pep[1] alone (reads poly[2,3])
- Wave 2: Process pep[2,3] in parallel (reads poly[4-7])
- Wave k: Process pep[2^(k-1) ... 2^k-1] in parallel
Compile-time parallelization control: Added disjoint_src_dst template parameter for explicit control
- Template parameter enables compile-time branch elimination for better optimization
- First round: passes true (source != destination, use simple parallelization)
- Subsequent rounds: uses default false (source == destination, use wave parallelization)
- Clear naming: parameter indicates when buffers are guaranteed disjoint
- Includes warning in documentation about data races if used incorrectly
Threshold-based parallelization: Only use parallel_for when wave size >= 32
- Small wave sizes (1, 2, 4, 8, 16) execute sequentially to avoid overhead
- Threshold of 32 determined through benchmark testing (16, 32, 64, 128)
- Optimal threshold provides best balance between parallelization benefits and overhead
AVM flavor support: Special-case parallelization by polynomials for AVM
- AVM requires different parallelization strategy due to its unique polynomial structure
- Uses compile-time isAvmFlavor<Flavor> detection
- Maintains backward compatibility with original AVM behavior
Code deduplication: Extract shared computation into process_polynomial_range lambda
- Eliminates duplication between parallel and sequential execution paths
- Single source of truth for the partial evaluation computation
Improved max_poly_end_index calculation: Use std::ranges::max_element instead of manual loop for cleaner, more functional style

Code Quality

Pass ThreadChunk by const reference instead of by value
More explicit and maintainable calculations throughout
Better documentation with clear warnings about unsafe usage

Benchmark Results

Performance data from ./scripts/benchmark_remote.sh chonk_bench "BB_BENCH=1 ./chonk_bench --benchmark_filter='Full/5'":

Threshold Optimization Results

Tested thresholds: 16, 32, 64, 128

Optimal: PARALLEL_THRESHOLD = 32

First round: 154.3ms
Subsequent rounds: 263.9ms
Total: 418.3ms

Comparison with other thresholds:

Threshold 16: 421.7ms total (3.4ms slower)
Threshold 64: 419.0ms total (0.7ms slower)
Threshold 128: 423.8ms total (5.5ms slower)

Overall Performance Impact

First round (rest of sumcheck round 1):

Time: 255.1ms → 155.6ms
Improvement: ~39.0% faster (99.5ms saved)
Benefits from detecting source != destination and using simple parallelization instead of waves

Subsequent rounds (sumcheck loop):

Time: 295.7ms → 271.1ms
Improvement: ~8.3% faster (24.6ms saved)
Wave-based approach is still required but the optimized implementation with pre-computed limits and outer parallelization provides moderate gains

Total partially_evaluate savings: ~124ms across all rounds

First round delivers the majority of gains (99.5ms) through simplified parallelization
Main loop shows consistent but more modest improvement (24.6ms) due to the continued need for wave-based coordination

Overall impact on "rest of sumcheck round 1": 225.4ms → 129.9ms (95.5ms saved, 42.4% faster)

The partially_evaluate optimization is the primary driver of the first round speedup

Context

This PR addresses issue https://github.com/AztecProtocol/barretenberg/issues/1579 which identified that partially_evaluate could be optimized with better parallelization strategies. The key insights are:

First round can avoid wave-based approach since it reads from full_polynomials and writes to partially_evaluated_polynomials (different memory)
Subsequent rounds need waves to avoid conflicts when updating in place
Small wave sizes benefit from sequential execution to avoid parallelization overhead
AVM flavor requires specialized parallelization by polynomials
Compile-time template parameters enable better compiler optimization

Testing

All sumcheck tests pass (27 passed, 5 skipped)
Benchmark validation confirms significant performance improvements
Tested multiple threshold values to find optimal setting

Closes #1579

🤖 Generated with Claude Code

Nov 06 '25 17:11 johnathan79717

Issue number seems wrong? And diff is huge because it's targeting master?

Nov 06 '25 17:11 fcarreiro

wow! 700K lines

Nov 06 '25 22:11 iakovenkos

My bad. Forgot to specify base branch when I ask Claude to create a PR

Nov 07 '25 10:11 johnathan79717

It would be good to get numbers on the claimed optimization. Also could you confirm there are no regressions for the AVM? I think you should be able to run HARDWARE_CONCURRENCY=16 yarn-project/scripts/run_test.sh bb-prover/src/avm_proving_tests/avm_bulk.test.ts and get the (sumcheck) numbers as output before and after.

Nov 07 '25 13:11 fcarreiro

I did see a couple of 100ms regression on AVM so I left AVM as is

Nov 10 '25 12:11 johnathan79717