o1js icon indicating copy to clipboard operation
o1js copied to clipboard

process hangs when verifying proof read&deserialized from file

Open Yoga2015 opened this issue 1 year ago • 22 comments

Original issue at Discord: https://discord.com/channels/484437221055922177/1204071024191082496

All the code at https://github.com/TokeniZK/zkapp-test/tree/main/src tried with o1js version: 0.15.2, 0.15.4, & 0.16.0

Reproduce steps:

I have a demo zkProgram(https://github.com/TokeniZK/zkapp-test/blob/main/src/zkprogram.ts),

step1: npm run step1 (seen at /src/step1.ts) I exec one method & gen proof, then serialize the proof into file: proof.json . and then read&deserialize it from file and verify it and smoothly get the result: True.

However, step2: (seen at /src/step2.ts) npm run step2 I start another process to read the proof from the file on step1(proof.json), deserialize it and try to verify the proof, But the process keep await at verify(*) line !!

Yoga2015 avatar Feb 05 '24 19:02 Yoga2015

hi, @mitschabaude, could u please suggest me? thx

Yoga2015 avatar Feb 05 '24 20:02 Yoga2015

Using compile({ forceRecompile: true }) fixes it

dfstio avatar Feb 06 '24 06:02 dfstio

thx!

Yoga2015 avatar Feb 06 '24 07:02 Yoga2015

Do I understand this correctly - you have to do compile({ forceRecompile: true }) every time you're loading the proof from a file?

Then it's still a bug, we do want this to work from cached verification keys. I'll reopen this to track it if you don't mind

mitschabaude avatar Feb 06 '24 07:02 mitschabaude

Yes, it is correct. I've created a small repo with my usual settings that work for me in MinaNFT: https://github.com/dfstio/zkapp-test/tree/dfst (branch dfst) based on @Yoga2015 code

@mitschabaude, you can use it for debugging

See also https://discord.com/channels/484437221055922177/1203303039633465344 https://discord.com/channels/484437221055922177/1204071024191082496

dfstio avatar Feb 06 '24 08:02 dfstio

Running in to the same problem with the dynamic verify method. It seems the workaround is to either forceRecompile: true or cache: Cache.None any program at all before attempting to verify proofs. It doesn't have to be the same program you're verifying proofs for.

Single-file repro. Will succeed on first run, hang on second run, and can be fixed by changing false to true on line 4.

import { Bool, Cache, verify, ZkProgram } from 'o1js';

//  v-- Change to `true` to fix the hang, even though NoopProgram != VerifyMeProgram.
if (false) {
  const NoopProgram = ZkProgram({
    name: 'NoopProgram',
    methods: {
      chaff: {
        privateInputs: [],
        async method(b) {}
      }
    }
  });
  await NoopProgram.compile({ cache: Cache.None });
}

const inputs = await (typeof getPreviousProof !== 'undefined' ? getPreviousProof : createProof)();
console.log('verifying...');
console.log('ok?', await verify(inputs.proof, inputs.verificationKey));

async function createProof() {
  const VerifyMeProgram = ZkProgram({
    name: 'NoopProgram',

    methods: {
      proof: {
        privateInputs: [Bool],
        async method(b) {
          b.assertTrue();
        },
      },
    },
  });

  console.log('compiling...');
  const { verificationKey } = await VerifyMeProgram.compile();
  console.log('proving...');
  const proof = (await VerifyMeProgram.proof(Bool(true))).toJSON();
  const data = { verificationKey, proof };
  const fs = await import('fs/promises');
  const url = await import('url');
  await fs.appendFile(url.fileURLToPath(import.meta.url), `async function getPreviousProof() { return ${JSON.stringify(data)}; }`);
  return data;
}

SpaceManiac avatar Jul 25 '24 18:07 SpaceManiac

Thanks for the great analysis @SpaceManiac

Seems that 'verify()' relies on some values computed during compilation, and isn't able to get them from a cache. We need to fix this!

mitschabaude avatar Jul 26 '24 05:07 mitschabaude

@Yoga2015 I cloned your repo zkapp-test and upgraded to o1js 1.6.0 and found that both scripts work

➜  zkapp-test git:(main) ✗ npm run step1

> [email protected] step1
> tsc -p tsconfig.json --module esnext && cross-env TEST_ON_BERKELEY=true node --experimental-specifier-resolution=node ./dist/src/step1.js

ProverX_VK: 4245952421456887355009860533252876175165009247779780553775848711375535231266
ok: true
proofStr: {"publicInput":[],"publicOutput":["0"],"maxProofsVerified":0,"proof":"KChzdGF0ZW1lbn.....yMUFGMTEzNzUpKSkpKSkp"}
ok2: true
➜  zkapp-test git:(main) ✗ npm run step2

> [email protected] step2
> tsc -p tsconfig.json --module esnext && cross-env TEST_ON_BERKELEY=true node --experimental-specifier-resolution=node ./dist/src/step2.js

ProverX_VK: 4245952421456887355009860533252876175165009247779780553775848711375535231266
str2: {"publicInput":[],"publicOutput":["0"],"maxProofsVerified":0,"proof":"KChzdGF0Z....EY4RjUpKSkpKSkp"}
ok2: true

45930 avatar Jul 30 '24 16:07 45930

I still see the hang on o1js 1.6.0. Stack trace for the record:

global.$std::sys::wasm::locks::futex_condvar::Condvar::wait::h28b31f59c39a18e1 (011a8aae:1926773)
global.$rayon_core::latch::LockLatch::wait_and_reset::h078cf456c369aa7d (011a8aae:1849956)
global.$std::thread::local::LocalKey<T>::with::h2ce6fa65e023e5b9 (011a8aae:1789370)
global.$rayon_core::registry::Registry::in_worker::h1abee179cbaab566 (011a8aae:1397531)
global.$caml_pasta_fq_plonk_proof_batch_verify (011a8aae:1843699)
module.exports.caml_pasta_fq_plonk_proof_batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/plonk_wasm.cjs:340)
caml_pasta_fq_plonk_proof_batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4577)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:169582)
Promise.then (Unknown Source:0)
deferred_run (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4882)
batch_verify$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:169580)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
batch_verify (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:159208)
batch_verify$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:159212)
caml_call1 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6801)
_iIy_ (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:349923)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4855)
Promise.then (Unknown Source:0)
deferred_bind (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:4853)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
verify_heterogenous (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:349950)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:355112)
caml_call1 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6801)
with_return (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:78372)
verify_promise$0 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:355110)
caml_call2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:6803)
verify$2 (/PROJECT/node_modules/o1js/dist/node/bindings/compiled/_node_bindings/o1js_node.bc.cjs:431486)
<anonymous> (/PROJECT/node_modules/o1js/dist/node/lib/proof-system/zkprogram.js:318)
withThreadPool (/PROJECT/node_modules/o1js/dist/node/bindings/js/node/node-backend.js:48)

SpaceManiac avatar Jul 30 '24 17:07 SpaceManiac

Running Yoga's original example with forceRecompile: false still hangs as well. They must have updated to using forceRecompile: true after that workaround was discovered.

@SpaceManiac in your snippet getPreviousProof is referenced but not defined. Is there a declaration missing?

45930 avatar Jul 30 '24 18:07 45930

createProof() appends its definition to the file after the first run (perhaps too cute...)

SpaceManiac avatar Jul 30 '24 19:07 SpaceManiac

Whatever this bug is, it seems deeply lodged in the Pickles code. Here's what I found...

There are 2 cases I looked at (mostly all reproducing what @SpaceManiac has already reported).

  1. Verifying a serialized JSON proof with a serialized verification key without compiling anything first
  2. Verifying a serialized JSON proof with a serialized verification key after compiling a random zk program circuit (not the same circuit that we are verifying).

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

o1js

In the example, we call the verify function exported in https://github.com/o1-labs/o1js/blob/0f57e36ce753c594c1c018ae2d4c677868a11864/src/lib/proof-system/zkprogram.ts#L475

In both cases, the proof we pass is in a string, so we hit the same code path. I've logged the proof and verification key in o1js from each case, and they are the exact same.

ocaml

Entering the Pickles layer, the call to verify happens in https://github.com/o1-labs/o1js-bindings/blob/8e99acc94da892eeb090367844f85d22ecf4f93b/ocaml/lib/pickles_bindings.ml#L713

In each case, we get passed the deserialization step and enter the async code:

  Pickles.Side_loaded.verify_promise ~typ [ (vk, statement, proof) ]
  |> Promise.map ~f:(fun x -> Js.bool (Or_error.is_ok x))
  |> Promise_js_helpers.to_js

That is as far as I can trace it for now.

45930 avatar Aug 27 '24 19:08 45930

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

Would this behavior change if you would call initializeBindings() first?

dfstio avatar Aug 28 '24 08:08 dfstio

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

Would this behavior change if you would call initializeBindings() first?

can't be, verify() calls that as well right at the beginning

mitschabaude avatar Aug 28 '24 09:08 mitschabaude

  • Verifying a serialized JSON proof with a serialized verification key without compiling anything first
  • Verifying a serialized JSON proof with a serialized verification key after compiling a random zk program circuit (not the same circuit that we are verifying).

The expected behavior is that these cases both work, but in the first case, the verification call hangs forever and never returns a boolean. The second case works as expected.

this is great progress! so it doesn't depend on any circuit-specific artifact

That is as far as I can trace it for now.

we will probably only resolve this issue by digging into the Ocaml/Rust verification code and finding out what implicit assumptions it makes on objects that exist

And we have a list of objects that are likely contenders: every large object (i.e. subject to caching instead of always computing on the fly) that is shared among different circuits/programs. Could be the SRS?

mitschabaude avatar Aug 28 '24 09:08 mitschabaude

Could be the SRS?

In my example, if I remove the following files from the cache:

wrap-pk-proverx
wrap-pk-proverx.header
wrap-vk-proverx
wrap-vk-proverx.header

verify() runs successfully. If I keep those files in the cache folder, verify() hangs

dfstio avatar Aug 28 '24 11:08 dfstio

If I keep those files in the cache folder, verify() hangs

that is when you compile before verifying right? my explanation for that would be that if those are not present, compile() "fully" recompiles, i.e. same behaviour as forceRecompile: true, and in the process also generates those other objects implicitly needed by verify()

mitschabaude avatar Aug 28 '24 13:08 mitschabaude

that is when you compile before verifying right?

Yes, that's right. Removing any other files does not affect the behavior; the process still hangs.

dfstio avatar Aug 28 '24 15:08 dfstio

we will probably only resolve this issue by digging into the Ocaml/Rust verification code and finding out what implicit assumptions it makes on objects that exist

@mitschabaude it looks like the verification code actually lives in the mina repo. Is that right? For instance, o1js-bindings calls Pickles.Side_loaded.verify_promise, but the implementation of that function is in mina in src/lib/pickles/pickles.ml. So would it make sense to open a new issue in that repo and link back to this one?

45930 avatar Aug 28 '24 15:08 45930

@mitschabaude it looks like the verification code actually lives in the mina repo. Is that right? For instance, o1js-bindings calls Pickles.Side_loaded.verify_promise, but the implementation of that function is in mina in src/lib/pickles/pickles.ml. So would it make sense to open a new issue in that repo and link back to this one?

yes it lives in the mina repo! no opinion on where to host the issue (it could also be that it's somehow caused by o1js behaviour of caching the SRS, which lives in the bindings repo

mitschabaude avatar Aug 28 '24 17:08 mitschabaude

Hey here, is there a workaround for this issue? Hard to tell from the comment thread. Many thanks.

andrewferrone avatar Sep 24 '24 17:09 andrewferrone

@andrewferrone, the best workaround at the moment is to compile an empty zk program, then use a cached proof from disk.

For instance

const NoopProgram = ZkProgram({
  name: 'NoopProgram',
  methods: {
    chaff: {
      privateInputs: [],
      async method(b) {}
    }
  }
});
await NoopProgram.compile({ cache: Cache.None });

45930 avatar Sep 24 '24 18:09 45930

This is not a realistic workaround if you're trying to ship an NPM package that others use. I don't want to include a disclaimer in my NPM package that if downloading the package breaks your build, you have to compile an empty zk program to fix it.

SebastienGllmt avatar Oct 03 '24 16:10 SebastienGllmt

Fair enough. We are working on this and prioritizing the fix. The work around should unblock many users, but I agree it’s not a suitable solution for your use case.

45930 avatar Oct 03 '24 16:10 45930

We're not pre-generating the langrange basis that we need before running the verifier. When we find it missing, the rust code panics, and WASM treats the panic as a halt. The main thread hangs waiting for the dead thread.

Will post a fix soon.

mrmr1993 avatar Oct 07 '24 21:10 mrmr1993

@45930 Fix here: https://github.com/o1-labs/o1js/pull/1857. Can you chase the right reviewers etc. to get it (and the corresponding PRs for the Mina, bindings, proof-systems submodules) merged?

mrmr1993 avatar Oct 08 '24 00:10 mrmr1993

@mrmr1993 I'll do my best :) thanks for pitching in!

45930 avatar Oct 08 '24 13:10 45930

Hey @SebastienGllmt, @andrewferrone, @kadirchan and others. We just released o1js version 1.9.0 which should fix this issue! Please re-open this issue if it is still not working for you :)

45930 avatar Oct 15 '24 20:10 45930