aligned_layer icon indicating copy to clipboard operation
aligned_layer copied to clipboard

fix: add panic catch on operator calling FFI

Open JulianVentura opened this issue 1 year ago • 2 comments

Motivation

We have some calls to external Rust functions on operator Golang code. If any of those functions panics, that error is propagated to the golang application and it fails completly. We want to catch those errors so this doesn't happen.

Changes

  • Add golang panic handling with recover() on VerifySp1Proof, VerifyRiscZeroReceipt and VerifyMerkleTreeBatch.
  • Add a function wrapper with a catch_unwind from Rust side for verify_sp1_proof_ffi, verify_risc_zero_receipt_ffi and verify_merkle_tree_batch_ffi.
  • Modify verify_sp1_proof_ffi, verify_risc_zero_receipt_ffi and verify_merkle_tree_batch_ffi to return i32 instead of bool to inform about errors.

How to test

In order to test properly, we first have to take the staging branch as baseline and add a panic!() in the following FFI Rust functions (one at a time):

  • verify_merkle_tree_batch_ffi on merkle_tree/lib/src/lib.rs
  • verify_sp1_proof_ffi on sp1/lib/src/lib.rs
  • verify_risc_zero_receipt_ffi on risc_zero/lib/src/lib.rs

This way, we will make the golang operator crash on a call to any of these FFIs, stoping execution immediately.

Note: Explicit instructions on how to run the operator are provided in the last section of this PR. The operator must be rebuilt after modifying the FFI functions

  1. To test the call to verify_merkle_tree_batch_ffi we have to run the hole system and start sending transactions with:
    • make batcher_send_infinite_sp1
  2. To test the call to verify_sp1_proof_ffi we have to run the hole system and start sending transactions with:
    • make batcher_send_infinite_sp1
  3. To test the call to verify_risc_zero_receipt_ffi we have to run the hole system and start sending transactions with:
    • make batcher_send_risc0_burst if you are on staging (you may need to repeat this a few times)
    • make batcher_send_risc0_big_burst if you are on fix/operator-ffi-panic-catch

Once you have tested the crashing behaviour, you can now switch to fix/operator-ffi-panic-catch branch and repeat the process, but this time the panics will have to be added to the inner versions of the provided FFI functions, which are:

  • inner_verify_merkle_tree_batch_ffi on merkle_tree/lib/src/lib.rs
  • inner_verify_sp1_proof_ffi on sp1/lib/src/lib.rs
  • inner_verify_risc_zero_receipt_ffi on risc_zero/lib/src/lib.rs

These should also be tested individually.

After executing you should see some errors being logged in the operator, providing information about the rust code panics. If the operator hasn't crashed, then everything is working properly.

How to run the operator

  1. Execute anvil:
  • make anvil_start_with_block_time
  1. Launch aggregator:
  • make aggregator_start
  1. Launch batcher:
  • make batcher_start_local
  1. Launch operator:
  • make build_operator
  • make operator_register_and_start CONFIG_FILE=config-files/config-operator-1.yaml

Closes #1174

JulianVentura avatar Oct 07 '24 22:10 JulianVentura

Would be nice a review from @Oppen

JuArce avatar Oct 09 '24 20:10 JuArce

This sucessfully caught panics for the merkle tree verification and sp1 verification but failed when I tested with risc0.

PatStiles avatar Oct 10 '24 18:10 PatStiles

I tested on macos and go still panics. Have you rebuilt the ffis after every change?

I might have tested it wrongly, but would you please check If it works on your machine if it does, I'll make sure to try again.

Could you check again, please?

JulianVentura avatar Oct 16 '24 19:10 JulianVentura