risc0 icon indicating copy to clipboard operation
risc0 copied to clipboard

Multi-thread testing results in error

Open SkymanOne opened this issue 1 year ago • 9 comments

Is there an existing issue?

  • [X] I have searched the existing issues

Experiencing problems? Have you tried our Discord first?

  • [X] This is not a support question.

Description of bug

When running multiple tests (in production mode) involving building a prover with default_prover() for each test and executing the same guest program on the same or different data inputs, a local prover can either panic with:

thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1109155563, 76776968, 726152720, 1606742656]
 right: [0, 0, 0, 0]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1937466798, 422672686, 84982592, 642582437]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [916425584, 1323635903, 560063291, 61845575]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [283844281, 1796260247, 1438206498, 1271669024]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1889568656, 1020291148, 1642680298, 421080635]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [414445105, 1970834655, 939333111, 991974128]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [420338028, 1713299230, 424368403, 1629257559]
 right: [0, 0, 0, 0]

which points to https://github.com/risc0/risc0/blob/2ba504fddd84376235d335ec4db6b2353d967fc9/risc0/zkp/src/prove/prover.rs#L338

or fail to prove the program at prover.prove(...).unwrap() with "verification indicates proof is invalid".

I suspect this is to do with the fact that Rust runs tests in multiple threads by default causing some issues with constraints generation, since running cargo test -- --test-threads=1 resolves an issue.

Steps to reproduce

It is difficult to provide a deterministic reproducer since the issue solely depends on the CPU runtime. However, I managed to hit a similar error with the start template by moving all the host prover-call code to the run() function and setting tests as:

#[cfg(test)]
mod test {
    use crate::run;

    #[test]
    fn test1() {
        run();
        run();
    }

    #[test]
    fn test2() {
        run();
    }

    #[test]
    fn test3() {
        run();
    }

    #[test]
    fn test4() {
        run();
    }

    #[test]
    fn test5() {
        run();
    }
}

and then running cargo test -r multiple times.

At some point you should get something like:

running 5 tests
test test::test3 ... FAILED
test test::test4 ... ok
test test::test5 ... ok
test test::test2 ... ok
test test::test1 ... ok

failures:

---- test::test3 stdout ----
thread 'test::test3' panicked at host/src/main.rs:36:10:
called `Result::unwrap()` on an `Err` value: verification indicates proof is invalid
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test::test3

SkymanOne avatar Oct 30 '24 18:10 SkymanOne

My suspicion is that it is related to https://github.com/risc0/risc0/blob/54febb7df36f8406d9393cbc3184920a24e9db21/risc0/zkvm/src/host/server/prove/mod.rs#L241 since when a new Rc reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps using Arc would address the issue.

SkymanOne avatar Oct 30 '24 18:10 SkymanOne

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

My suspicion is that it is related to

https://github.com/risc0/risc0/blob/54febb7df36f8406d9393cbc3184920a24e9db21/risc0/zkvm/src/host/server/prove/mod.rs#L241

since when a new Rc reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps using Arc would address the issue.

I don't think so, that value won't be shared across test threads, if this was Arc it would be the same.

Rust blocks even moving the Rc across thread boundaries (example)

austinabell avatar Oct 31 '24 00:10 austinabell

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.

Tests are run in production mode, so full prove generation.

SkymanOne avatar Oct 31 '24 09:10 SkymanOne

I don't think so, that value won't be shared across test threads, if this was Arc it would be the same.

Maybe it is not necessary related to the Rc I pointed out. However, it is very likely there is some global state that gets shared across multiple test threads that causes data racing. Running the tests causes different ones to randomly fail even in the starter template.

SkymanOne avatar Oct 31 '24 09:10 SkymanOne

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.

Tests are run in production mode, so full prove generation.

Running these in parallel will result in the apple GPU running out of memory and results in these errors. For proving tests, we run them in serial to avoid this. we suggest that you do the same for your workload

SchmErik avatar Nov 05 '24 07:11 SchmErik

I’m facing a similar issue with ECDSA signature verification inside the guest environment. Running this on mac with "metal" feature, without any multithreading, results in a similar error.

To reproduce the issue, you can navigate to the https://github.com/dymchenkko/oyster-monorepo/commit/5f2f817ae492a9c465e223c99e4b25706def06ce (verifier-risczero directory)

dymchenkko avatar Nov 13 '24 09:11 dymchenkko

I observed the similar issue - appearing randomly for exactly the same input - while running Bento proving cluster:

gpu_prove_agent3-1  | 2025-08-08T10:28:54.327987Z ERROR workflow: Failure during task processing: Prove failed
gpu_prove_agent3-1  |
gpu_prove_agent3-1  | Caused by:
gpu_prove_agent3-1  |     0: Failed to prove segment
gpu_prove_agent3-1  |     1: verification indicates proof is invalid

koxu1996 avatar Aug 08 '25 10:08 koxu1996

It also happens from time to time when proving locally:

Generating proof...

thread 'main' panicked at zkvm/risc0-host/src/main.rs:47:8:
called `Result::unwrap()` on an `Err` value: verify segment

Caused by:
    verification indicates proof is invalid
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Notably, my input uses HashMaps, so program execution is not deterministic.

koxu1996 avatar Sep 28 '25 09:09 koxu1996