lurk-rs
lurk-rs copied to clipboard
Benchmarks (umbrella)
Benchmark Organization in 5 Points
-
Need for Diverse Benchmarking Infrastructure: We need a comprehensive benchmarking infrastructure that measures various aspects across different benchmarks, with an emphasis on developing shared vocabulary and terminology.
- Next Steps: Develop and implement a diverse set of benchmarks and agree on common terminology for discussion.
-
Static Analysis of Performance: It’s highlighted that certain performance characteristics, like Lurk reduction algorithm efficiency, can be understood 'statically' based on code changes and don’t require testing on powerful machines.
- Next Steps: Continue static analysis for algorithmic improvements and integrate findings into tests.
-
Real-world Performance Measurement: There’s a need to measure real-world performance of the proving pipeline, including aspects like proof time, throughput, hardware requirements, and parameter size.
- Next Steps: Establish benchmarking infrastructure for real-world performance metrics and periodically validate these metrics.
-
Non-regression Testing: The importance of non-regression testing is emphasized to prevent unexpected performance issues due to changes in code or dependencies.
- Next Steps: Implement non-regression tests to block merging if negative performance impact is detected. #790
-
FoldingConfig and NIVC Considerations: The introduction of NIVC necessitates considering performance variations and benchmarking different FoldingConfigs.
- Next Steps: Experimentally determine the optimal FoldingConfig and develop benchmarks to measure their performance.
Older Brainstorm : What to benchmark?
We need some good benchmarks exposing performance of many parts of the system so we can:
- optimize individual things we know are important (and measure)
- discover how best to use the system as it is based on observed performance (what are good vs bad patterns)
- track change over time and avoid accidental regressions, etc.
It would be good to have a general end-to-end benchmark that starts with Lurk source code (text), and ends with a successfully verified proof (first milestone: #378). We should then also have a corpus of 'interesting' input programs. This issue is about what the standard benchmark we run on each such program should look like:
We should make sure to measure each phase that happens with every proof:
- creation of the store
- Parsing/reading of the source into the store.
- Hydration of the ScalarCache. Note: the split between this and the previous will depend on the program — since some programs need to (or may even if they don't need to) hash more eagerly.
- Evaluation
- circuit synthesis
- witness generation (currently largely conflated with circuit synthesis)
- other per-proof and per-reduction (evaluation/proof) steps like constant ptr and global allocation creation
- Folding (more generally, the incremental per-circuit cost — which is 'folding' when using Nova but is proving of a single step circuit when using, e.g. Groth16).
- Final SNARK (which is 'aggregation' with Groth16/SnarkPack+)
- Verification
- any parameter loading/generation for folding, final SNARK, and verification
- saving of the store [not currently supported] — or at least those parts of it which are ephemeral and can't be regenerated (i.e. private data, new secrets and their relationship to commitments, etc.)
We should also measure the parts that can be factored out:
- public parameter generation
- prover key generation
- verifier key generation
We should measure a range of parameters, though not necessarily every combination of all parameters, as the number grows. These should at least include:
- Number of frames per step circuit (
chunk_frame_count
). - Backend
- Final SNARK (currently one option for Nova, but there will be choices in the future)
- Zero-knowledge yes/no (not a current option, but will be)
- [many more in the future] — noted here so we can design for that.
- GPU? [none | cuda | openCl ] / how many, what specs?
- CPU — how many cores, how much RAM, etc.
Things we should measure for the above:
-
time
-
memory usage
-
cpu usage (total, but also allocation: are we making good use of multiple cores?)
-
data size (concrete size of things like cached parameters, proofs, initial store, final store, serialized/disk stores, etc.)
-
number of hashes (per arity)
-
more fine-grained hashing (by type, etc.)
-
other fine-grained events in the store
-
[x] #378
Update:
- we have a synthesis benchmark from #415