mina icon indicating copy to clipboard operation
mina copied to clipboard

Blockchain snark failure when producing a block during catchup

Open ghost-not-in-the-shell opened this issue 1 year ago • 1 comments

We saw this blockchain snark failure in ITN:

"Constraint unsatisfied (unreduced):
Checked.Assert.equal
get_req: File "src/lib/snarky/src/base/merkle_tree.ml", line 447, characters 2-465
File "src/lib/consensus/proof_of_stake.ml", line 717, characters 21-28
get_vrf_evaluation: File "src/lib/consensus/proof_of_stake.ml", line 707, characters 6-1743
check: File "src/lib/consensus/proof_of_stake.ml", line 750, characters 8-1141
update_var: File "src/lib/consensus/proof_of_stake.ml", line 2178, characters 6-7348
next_state_checked: File "src/lib/consensus/proof_of_stake.ml", line 3437, characters 8-822
File "src/lib/blockchain_snark/blockchain_snark_state.ml", line 193, characters 15-22
step: File "src/lib/blockchain_snark/blockchain_snark_state.ml", line 141, characters 0-9581
rule_main
step_main

Constraint:
((basic(Equal(Var 8959)(Var 16837)))(annotation(Checked.Assert.equal)))
Data:
Equal 7928713220883902739236776858773952015557165966707513751241091166217980840633 4827442158274587516628819691165535377305385322414858796520817553439500966335"

here's the logs from gcloud

The cause of this blockchain snark error is that when producing block it's using the curr_epoch_snapshot instead of the last_epoch_snapshot as it should be. The root cause of this is that block producer didn't wait for catchup to finish before it does the vrf evaluation.

When vrf evaluation happens, we only have the root transition and the next_epoch_data.epoch of the root transition is less than k, which means epoch_is_not_finalized and this condition determines that we would pick curr_epoch_snapshot instead of the last_epoch_snapshot. see https://github.com/MinaProtocol/mina/blob/44af5c2865deb8b0363b8d976cb56f5624969b0c/src/lib/consensus/proof_of_stake.ml#L2660 If the node finished catchup, this condition would be false.

More generally, if we didn't finish catchup, then the snapshot returns by select_epoch_snapshot could be wrong. This commented out assertion would actually catch this: https://github.com/MinaProtocol/mina/blob/44af5c2865deb8b0363b8d976cb56f5624969b0c/src/lib/consensus/proof_of_stake.ml#L3105C39-L3105C39 But this should not be an assertion, otherwise the node would crash.

My suggestion is to make block producer be aware of the frontier length at least. If we don't want to deal with the status of the node, we should at least have the node to only do vrf evaluation until we have a full frontier (unless it's in the first epoch). Another thing we could do is that pass the current frontier length to get_epoch_data_for_vrf and modify the select_epoch_snapshot function so that it would also consider the transition frontier length. This would prevent what we saw here.

ghost-not-in-the-shell avatar Nov 27 '23 19:11 ghost-not-in-the-shell