zkevm-circuits icon indicating copy to clipboard operation
zkevm-circuits copied to clipboard

About performance characteristics of prover

Open pinkiebell opened this issue 2 years ago • 3 comments

Here some bullet points I mentioned:

  • Most of the time the prover only utilizes a single core. (generating params & proving key)?
  • Once multiple cores are used (halo2) the cores are only busy 1/3 of the time. I can dig deeper into this and profile the process but I guess this happens due:
    1. cache line starving
    2. ALU pipeline is full

Given a 96 core machine the load average is only about ~10 in aggregate. I'm speculating that point 2 is the biggest problem here because proofing on a M1 machine is considerably faster and one of the biggest differences are the much wider pipeline.

However, at the moment the memory consumption is the biggest problem.

pinkiebell avatar Dec 02 '22 09:12 pinkiebell

It's a bit weird that prover which should be kind of cpu bound application therefore cpu ideally should be high utilized. Checked Halo2 repo do not found lock or sync package related. Maybe can try some profiling tool to further narrow down the issue

hero78119 avatar Dec 02 '22 10:12 hero78119

It's a bit weird that prover which should be kind of cpu bound application therefore cpu ideally should be high utilized. Checked Halo2 repo do not found lock or sync package related. Maybe can try some profiling tool to further narrow down the issue

They are highly utilized in short bursts but even tho it uses all cores, most of the cores are only used 1/3 of their available cycles and I think that the ALU pipeline may be the culprit here. That is, the ALU pipeline is full and the core sits idle and waiting for other instructions. However I will dig deeper into that soon 👍

pinkiebell avatar Dec 02 '22 10:12 pinkiebell

Some preliminary data, this was tested on a amd epyc machine (AWS r6a.48xlarge). I still have to figure out all the available performance event counters - most of them are specific to the cpu.

2022-12-05T18:46:17.7620132Z 	Command being timed: "perf stat -o PROVER_PERF.txt -e stalled-cycles-backend -e stalled-cycles-frontend -e instructions -e branch-instructions -e ic_fetch_stall.ic_stall_any -e ic_fetch_stall.ic_stall_back_pressure -e ic_fetch_stall.ic_stall_dq_empty -e sse_avx_stalls -e all_data_cache_accesses -e all_tlbs_flushed -e l1_data_cache_fills_all -e fp_ret_sse_avx_ops.all -e l1_data_cache_fills_all -e l2_cache_accesses_from_dc_misses -e l2_cache_accesses_from_ic_misses -e ic_tag_hit_miss.all_instruction_cache_accesses -e ic_tag_hit_miss.instruction_cache_hit -e ic_tag_hit_miss.instruction_cache_miss -- cargo run --release --bin prover_rpcd"
2022-12-05T18:46:17.7621106Z 	User time (seconds): 121359.60
2022-12-05T18:46:17.7621376Z 	System time (seconds): 9661.61
2022-12-05T18:46:17.7621889Z 	Percent of CPU this job got: 725%
2022-12-05T18:46:17.7622194Z 	Elapsed (wall clock) time (h:mm:ss or m:ss): 5:00:59
2022-12-05T18:46:17.7622461Z 	Average shared text size (kbytes): 0
2022-12-05T18:46:17.7622994Z 	Average unshared data size (kbytes): 0
2022-12-05T18:46:17.7623235Z 	Average stack size (kbytes): 0
2022-12-05T18:46:17.7623472Z 	Average total size (kbytes): 0
2022-12-05T18:46:17.7623726Z 	Maximum resident set size (kbytes): 1034919124
2022-12-05T18:46:17.7623952Z 	Average resident set size (kbytes): 0
2022-12-05T18:46:17.7624234Z 	Major (requiring I/O) page faults: 22
2022-12-05T18:46:17.7624911Z 	Minor (reclaiming a frame) page faults: 1603307850
2022-12-05T18:46:17.7625148Z 	Voluntary context switches: 26380990
2022-12-05T18:46:17.7625398Z 	Involuntary context switches: 2754680
2022-12-05T18:46:17.7625712Z + [ 101 -eq 0 ]
2022-12-05T18:46:17.7625969Z + mkdir errors
2022-12-05T18:46:17.7626622Z 	Swaps: 0
2022-12-05T18:46:17.7626933Z 	File system inputs: 0
2022-12-05T18:46:17.7627243Z 	File system outputs: 120
2022-12-05T18:46:17.7627590Z 	Socket messages sent: 0
2022-12-05T18:46:17.7627965Z 	Socket messages received: 0
2022-12-05T18:46:17.7628270Z 	Signals delivered: 0
2022-12-05T18:46:17.7628600Z 	Page size (bytes): 4096
2022-12-05T18:46:17.7628944Z 	Exit status: 0
2022-12-05T18:46:17.7629255Z # started on Mon Dec  5 13:45:17 2022
2022-12-05T18:46:17.7629435Z 
2022-12-05T18:46:17.7629470Z 
2022-12-05T18:46:17.7629799Z  Performance counter stats for 'cargo run --release --bin prover_rpcd':
2022-12-05T18:46:17.7630085Z 
2022-12-05T18:46:17.7630474Z      5284758515953      stalled-cycles-backend           #    0.00% backend cycles idle      (33.31%)
2022-12-05T18:46:17.7631102Z      2739402716197      stalled-cycles-frontend                                              (33.33%)
2022-12-05T18:46:17.7631531Z    575542245678316      instructions                                                       
2022-12-05T18:46:17.7631920Z                                                   #    0.01  stalled cycles per insn  (33.35%)
2022-12-05T18:46:17.7632495Z     41897380129375      branch-instructions                                                  (33.37%)
2022-12-05T18:46:17.7633171Z    338337575660554      ic_fetch_stall.ic_stall_any                                          (33.38%)
2022-12-05T18:46:17.7633602Z      4335218733455      ic_fetch_stall.ic_stall_back_pressure                                     (33.38%)
2022-12-05T18:46:17.7634018Z      2277872242781      ic_fetch_stall.ic_stall_dq_empty                                     (33.37%)
2022-12-05T18:46:17.7634422Z           63222699      sse_avx_stalls                                                       (33.36%)
2022-12-05T18:46:17.7634808Z    151780850257552      all_data_cache_accesses                                              (33.35%)
2022-12-05T18:46:17.7635274Z           14298839      all_tlbs_flushed                                                     (33.35%)
2022-12-05T18:46:17.7635649Z      4425884141803      l1_data_cache_fills_all                                              (33.34%)
2022-12-05T18:46:17.7636057Z           33615489      fp_ret_sse_avx_ops.all                                               (27.78%)
2022-12-05T18:46:17.7636455Z      4264140628590      l1_data_cache_fills_all                                              (27.77%)
2022-12-05T18:46:17.7636854Z      5140526723476      l2_cache_accesses_from_dc_misses                                     (27.76%)
2022-12-05T18:46:17.7637283Z       202606229756      l2_cache_accesses_from_ic_misses                                     (27.74%)
2022-12-05T18:46:17.7637721Z                  0      ic_tag_hit_miss.all_instruction_cache_accesses                                     (27.73%)
2022-12-05T18:46:17.7638195Z                  0      ic_tag_hit_miss.instruction_cache_hit                                     (27.73%)
2022-12-05T18:46:17.7638623Z                  0      ic_tag_hit_miss.instruction_cache_miss                                     (33.29%)
2022-12-05T18:46:17.7638870Z 
2022-12-05T18:46:17.7639005Z    18059.890712709 seconds time elapsed
2022-12-05T18:46:17.7639199Z 
2022-12-05T18:46:17.7639327Z   121359.604562000 seconds user
2022-12-05T18:46:17.7639598Z     9661.605648000 seconds sys
2022-12-05T18:46:17.7639773Z 

pinkiebell avatar Dec 06 '22 09:12 pinkiebell

Just a heads up, most of the low hanging improvements are now due to RLC in the zkevm-circuits configuration/synth step. After looking at profiling intervals it looks like that is a great bottleneck. The halo2 prover itself looks fine

pinkiebell avatar Feb 15 '23 09:02 pinkiebell