snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

[Bug] Low prover CPU utilization

Open xbeastx opened this issue 2 years ago • 5 comments

🐛 Bug Report

One snarkOS prover process utilize like 50-60% avg CPU.

Steps to Reproduce

  • Just run ./run-prover.sh
  • Start htop to monitor

Expected Behavior

95-100% utilization

The same problem was in testnet2 actually: https://github.com/AleoHQ/snarkOS/issues/1346

The thing is that in code for parallelism par_iter often is used. But in reality you couldn't know which coinbase_puzzle_loop-thread will use how many par_iter threads in which moment. So better strategy here for performance will be to create for each coinbase_puzzle_loop own ThreadPool with not so much threads in pool.

From our tests on AMD EPYC 7502P 32-Core (64 threads) Default snarkOS code: gives 50-60% CPU utilization and 50-80 c/s With new strategy: with 20 coinbase_puzzle_loop-threads in each threadpool with 4 threads. you will get increase up to 145 c/s so almost double. Utilization with this will be 95-98% and it's controllable through coinbase_puzzle_loop threads counter.

xbeastx avatar Dec 02 '22 21:12 xbeastx

my experiment with the prover indicates that if we don't use parallelization within each thread pool (aka one thread per pool) it leads to faster speed. This is viable because without parallelization you can fully use every core dedicated to the process without waiting for some choke points, and each KZG10 is short enough so it almost won't negatively affect the performance.

HarukaMa avatar Dec 03 '22 03:12 HarukaMa

@HarukaMa would you be open to share a code snippet or sample PR for how you are architecting this? We can then base a design around it

howardwu avatar Dec 04 '22 19:12 howardwu

Would be something like this:

let puzzle_instances = 20;
let parallel_threads = 4;

let mut thread_pools = Vec::new();
for _ in 0..puzzle_instances {
    thread_pools.push(Arc::new(
        ThreadPoolBuilder::new()
            .stack_size(8 * 1024 * 1024)
            .num_threads(parallel_threads)
            .build()
            .expect("Failed to create thread pool"),
    ));
}

for i in 0..thread_pools.len() {
    ...
    let task = thread_pools[i].spawn_async(move || {
        let mut rng = thread_rng();
        loop {
            ...
            let result = coinbase_puzzle
                .prove(&epoch_challenge, address, rng.gen(), Some(coinbase_target))
                .ok()
                .and_then(|solution| solution.to_target().ok().map(|solution_target| (solution_target, solution)));
            ...
            // If the Ctrl-C handler registered the signal, stop the prover.
            if self.shutdown.load(Ordering::Relaxed) {
                trace!("Shutting down the coinbase puzzle...");
                break;
            }
            ...
        }
    };
}

Sorry, don’t have ready to use code for snarkOS.

So may be even:

let puzzle_instances = 64;
let parallel_threads = 1;

much better as @HarukaMa said. I didn't test yet.

xbeastx avatar Dec 04 '22 20:12 xbeastx

@howardwu Watch at it https://github.com/HarukaMa/aleo-prover

extSunset avatar Dec 04 '22 22:12 extSunset

@ljedrz We also use thread pool approach in snarkVM to make things performant. Do you have recommendations on resource allocation between tokio and the threadpool to ensure snarkOS runs fairly?

howardwu avatar Dec 04 '22 23:12 howardwu

Closing as most provers seem to have moved on to dedicated hardware. Please feel free to reopen if this continues to be a concern.

howardwu avatar Oct 09 '23 23:10 howardwu