snarkOS
snarkOS copied to clipboard
[Bug] Low prover CPU utilization
🐛 Bug Report
One snarkOS prover process utilize like 50-60% avg CPU.
Steps to Reproduce
- Just run ./run-prover.sh
- Start htop to monitor
Expected Behavior
95-100% utilization
The same problem was in testnet2 actually: https://github.com/AleoHQ/snarkOS/issues/1346
The thing is that in code for parallelism par_iter
often is used. But in reality you couldn't know which coinbase_puzzle_loop
-thread will use how many par_iter
threads in which moment. So better strategy here for performance will be to create for each
coinbase_puzzle_loop
own ThreadPool
with not so much threads in pool.
From our tests on AMD EPYC 7502P 32-Core (64 threads)
Default snarkOS code:
gives 50-60% CPU utilization and 50-80 c/s
With new strategy:
with 20 coinbase_puzzle_loop
-threads in each threadpool
with 4 threads.
you will get increase up to 145 c/s so almost double. Utilization with this will be 95-98% and it's controllable through coinbase_puzzle_loop
threads counter.
my experiment with the prover indicates that if we don't use parallelization within each thread pool (aka one thread per pool) it leads to faster speed. This is viable because without parallelization you can fully use every core dedicated to the process without waiting for some choke points, and each KZG10 is short enough so it almost won't negatively affect the performance.
@HarukaMa would you be open to share a code snippet or sample PR for how you are architecting this? We can then base a design around it
Would be something like this:
let puzzle_instances = 20;
let parallel_threads = 4;
let mut thread_pools = Vec::new();
for _ in 0..puzzle_instances {
thread_pools.push(Arc::new(
ThreadPoolBuilder::new()
.stack_size(8 * 1024 * 1024)
.num_threads(parallel_threads)
.build()
.expect("Failed to create thread pool"),
));
}
for i in 0..thread_pools.len() {
...
let task = thread_pools[i].spawn_async(move || {
let mut rng = thread_rng();
loop {
...
let result = coinbase_puzzle
.prove(&epoch_challenge, address, rng.gen(), Some(coinbase_target))
.ok()
.and_then(|solution| solution.to_target().ok().map(|solution_target| (solution_target, solution)));
...
// If the Ctrl-C handler registered the signal, stop the prover.
if self.shutdown.load(Ordering::Relaxed) {
trace!("Shutting down the coinbase puzzle...");
break;
}
...
}
};
}
Sorry, don’t have ready to use code for snarkOS.
So may be even:
let puzzle_instances = 64;
let parallel_threads = 1;
much better as @HarukaMa said. I didn't test yet.
@howardwu Watch at it https://github.com/HarukaMa/aleo-prover
@ljedrz We also use thread pool approach in snarkVM to make things performant. Do you have recommendations on resource allocation between tokio and the threadpool to ensure snarkOS runs fairly?
Closing as most provers seem to have moved on to dedicated hardware. Please feel free to reopen if this continues to be a concern.