zkevm-circuits
zkevm-circuits copied to clipboard
About performance characteristics of prover
Here some bullet points I mentioned:
- Most of the time the prover only utilizes a single core. (generating params & proving key)?
- Once multiple cores are used (halo2) the cores are only busy 1/3 of the time.
I can dig deeper into this and profile the process but I guess this happens due:
- cache line starving
- ALU pipeline is full
Given a 96 core machine the load average is only about ~10 in aggregate. I'm speculating that point 2 is the biggest problem here because proofing on a M1 machine is considerably faster and one of the biggest differences are the much wider pipeline.
However, at the moment the memory consumption is the biggest problem.
It's a bit weird that prover which should be kind of cpu bound application therefore cpu ideally should be high utilized. Checked Halo2 repo do not found lock or sync
package related. Maybe can try some profiling tool to further narrow down the issue
It's a bit weird that prover which should be kind of cpu bound application therefore cpu ideally should be high utilized. Checked Halo2 repo do not found lock or
sync
package related. Maybe can try some profiling tool to further narrow down the issue
They are highly utilized in short bursts but even tho it uses all cores, most of the cores are only used 1/3 of their available cycles and I think that the ALU pipeline may be the culprit here. That is, the ALU pipeline is full and the core sits idle and waiting for other instructions. However I will dig deeper into that soon 👍
Some preliminary data, this was tested on a amd epyc machine (AWS r6a.48xlarge). I still have to figure out all the available performance event counters - most of them are specific to the cpu.
2022-12-05T18:46:17.7620132Z Command being timed: "perf stat -o PROVER_PERF.txt -e stalled-cycles-backend -e stalled-cycles-frontend -e instructions -e branch-instructions -e ic_fetch_stall.ic_stall_any -e ic_fetch_stall.ic_stall_back_pressure -e ic_fetch_stall.ic_stall_dq_empty -e sse_avx_stalls -e all_data_cache_accesses -e all_tlbs_flushed -e l1_data_cache_fills_all -e fp_ret_sse_avx_ops.all -e l1_data_cache_fills_all -e l2_cache_accesses_from_dc_misses -e l2_cache_accesses_from_ic_misses -e ic_tag_hit_miss.all_instruction_cache_accesses -e ic_tag_hit_miss.instruction_cache_hit -e ic_tag_hit_miss.instruction_cache_miss -- cargo run --release --bin prover_rpcd"
2022-12-05T18:46:17.7621106Z User time (seconds): 121359.60
2022-12-05T18:46:17.7621376Z System time (seconds): 9661.61
2022-12-05T18:46:17.7621889Z Percent of CPU this job got: 725%
2022-12-05T18:46:17.7622194Z Elapsed (wall clock) time (h:mm:ss or m:ss): 5:00:59
2022-12-05T18:46:17.7622461Z Average shared text size (kbytes): 0
2022-12-05T18:46:17.7622994Z Average unshared data size (kbytes): 0
2022-12-05T18:46:17.7623235Z Average stack size (kbytes): 0
2022-12-05T18:46:17.7623472Z Average total size (kbytes): 0
2022-12-05T18:46:17.7623726Z Maximum resident set size (kbytes): 1034919124
2022-12-05T18:46:17.7623952Z Average resident set size (kbytes): 0
2022-12-05T18:46:17.7624234Z Major (requiring I/O) page faults: 22
2022-12-05T18:46:17.7624911Z Minor (reclaiming a frame) page faults: 1603307850
2022-12-05T18:46:17.7625148Z Voluntary context switches: 26380990
2022-12-05T18:46:17.7625398Z Involuntary context switches: 2754680
2022-12-05T18:46:17.7625712Z + [ 101 -eq 0 ]
2022-12-05T18:46:17.7625969Z + mkdir errors
2022-12-05T18:46:17.7626622Z Swaps: 0
2022-12-05T18:46:17.7626933Z File system inputs: 0
2022-12-05T18:46:17.7627243Z File system outputs: 120
2022-12-05T18:46:17.7627590Z Socket messages sent: 0
2022-12-05T18:46:17.7627965Z Socket messages received: 0
2022-12-05T18:46:17.7628270Z Signals delivered: 0
2022-12-05T18:46:17.7628600Z Page size (bytes): 4096
2022-12-05T18:46:17.7628944Z Exit status: 0
2022-12-05T18:46:17.7629255Z # started on Mon Dec 5 13:45:17 2022
2022-12-05T18:46:17.7629435Z
2022-12-05T18:46:17.7629470Z
2022-12-05T18:46:17.7629799Z Performance counter stats for 'cargo run --release --bin prover_rpcd':
2022-12-05T18:46:17.7630085Z
2022-12-05T18:46:17.7630474Z 5284758515953 stalled-cycles-backend # 0.00% backend cycles idle (33.31%)
2022-12-05T18:46:17.7631102Z 2739402716197 stalled-cycles-frontend (33.33%)
2022-12-05T18:46:17.7631531Z 575542245678316 instructions
2022-12-05T18:46:17.7631920Z # 0.01 stalled cycles per insn (33.35%)
2022-12-05T18:46:17.7632495Z 41897380129375 branch-instructions (33.37%)
2022-12-05T18:46:17.7633171Z 338337575660554 ic_fetch_stall.ic_stall_any (33.38%)
2022-12-05T18:46:17.7633602Z 4335218733455 ic_fetch_stall.ic_stall_back_pressure (33.38%)
2022-12-05T18:46:17.7634018Z 2277872242781 ic_fetch_stall.ic_stall_dq_empty (33.37%)
2022-12-05T18:46:17.7634422Z 63222699 sse_avx_stalls (33.36%)
2022-12-05T18:46:17.7634808Z 151780850257552 all_data_cache_accesses (33.35%)
2022-12-05T18:46:17.7635274Z 14298839 all_tlbs_flushed (33.35%)
2022-12-05T18:46:17.7635649Z 4425884141803 l1_data_cache_fills_all (33.34%)
2022-12-05T18:46:17.7636057Z 33615489 fp_ret_sse_avx_ops.all (27.78%)
2022-12-05T18:46:17.7636455Z 4264140628590 l1_data_cache_fills_all (27.77%)
2022-12-05T18:46:17.7636854Z 5140526723476 l2_cache_accesses_from_dc_misses (27.76%)
2022-12-05T18:46:17.7637283Z 202606229756 l2_cache_accesses_from_ic_misses (27.74%)
2022-12-05T18:46:17.7637721Z 0 ic_tag_hit_miss.all_instruction_cache_accesses (27.73%)
2022-12-05T18:46:17.7638195Z 0 ic_tag_hit_miss.instruction_cache_hit (27.73%)
2022-12-05T18:46:17.7638623Z 0 ic_tag_hit_miss.instruction_cache_miss (33.29%)
2022-12-05T18:46:17.7638870Z
2022-12-05T18:46:17.7639005Z 18059.890712709 seconds time elapsed
2022-12-05T18:46:17.7639199Z
2022-12-05T18:46:17.7639327Z 121359.604562000 seconds user
2022-12-05T18:46:17.7639598Z 9661.605648000 seconds sys
2022-12-05T18:46:17.7639773Z
Just a heads up, most of the low hanging improvements are now due to RLC in the zkevm-circuits configuration/synth step. After looking at profiling intervals it looks like that is a great bottleneck. The halo2 prover itself looks fine