fix: add panic catch on operator calling FFI
Motivation
We have some calls to external Rust functions on operator Golang code. If any of those functions panics, that error is propagated to the golang application and it fails completly. We want to catch those errors so this doesn't happen.
Changes
- Add golang panic handling with
recover()onVerifySp1Proof,VerifyRiscZeroReceiptandVerifyMerkleTreeBatch. - Add a function wrapper with a
catch_unwindfrom Rust side forverify_sp1_proof_ffi,verify_risc_zero_receipt_ffiandverify_merkle_tree_batch_ffi. - Modify
verify_sp1_proof_ffi,verify_risc_zero_receipt_ffiandverify_merkle_tree_batch_ffito returni32instead ofboolto inform about errors.
How to test
In order to test properly, we first have to take the staging branch as baseline and add a panic!() in the following FFI Rust functions (one at a time):
verify_merkle_tree_batch_ffionmerkle_tree/lib/src/lib.rsverify_sp1_proof_ffionsp1/lib/src/lib.rsverify_risc_zero_receipt_ffionrisc_zero/lib/src/lib.rs
This way, we will make the golang operator crash on a call to any of these FFIs, stoping execution immediately.
Note: Explicit instructions on how to run the operator are provided in the last section of this PR. The operator must be rebuilt after modifying the FFI functions
- To test the call to
verify_merkle_tree_batch_ffiwe have to run the hole system and start sending transactions with:make batcher_send_infinite_sp1
- To test the call to
verify_sp1_proof_ffiwe have to run the hole system and start sending transactions with:make batcher_send_infinite_sp1
- To test the call to
verify_risc_zero_receipt_ffiwe have to run the hole system and start sending transactions with:make batcher_send_risc0_burstif you are onstaging(you may need to repeat this a few times)make batcher_send_risc0_big_burstif you are onfix/operator-ffi-panic-catch
Once you have tested the crashing behaviour, you can now switch to fix/operator-ffi-panic-catch branch and repeat the process, but this time the panics will have to be added to the inner versions of the provided FFI functions, which are:
inner_verify_merkle_tree_batch_ffionmerkle_tree/lib/src/lib.rsinner_verify_sp1_proof_ffionsp1/lib/src/lib.rsinner_verify_risc_zero_receipt_ffionrisc_zero/lib/src/lib.rs
These should also be tested individually.
After executing you should see some errors being logged in the operator, providing information about the rust code panics. If the operator hasn't crashed, then everything is working properly.
How to run the operator
- Execute anvil:
make anvil_start_with_block_time
- Launch aggregator:
make aggregator_start
- Launch batcher:
make batcher_start_local
- Launch operator:
make build_operatormake operator_register_and_start CONFIG_FILE=config-files/config-operator-1.yaml
Closes #1174
Would be nice a review from @Oppen
This sucessfully caught panics for the merkle tree verification and sp1 verification but failed when I tested with risc0.
I tested on macos and go still panics. Have you rebuilt the ffis after every change?
I might have tested it wrongly, but would you please check If it works on your machine if it does, I'll make sure to try again.
Could you check again, please?