substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Babe: bad epoch index with skipped epochs and warp sync

Open davxy opened this issue 2 years ago • 8 comments

May happen that a warp synced node ends up producing a wrong VRF output in case of 1+ Babe skipped epochs and warp sync.

This new issue emerges given the following conditions:

  1. Some epochs are skipped

image

  1. Validator is resumed.

    2.1) A new block B is produced. 2.2) As epoch data it will use the data intended for the first skipped epoch (N). Note that the epoch data for this block is taken from the epoch changes tree that still contains as epoch-index N. (it has not been updated because different - even future - forks may not contain this skipped epochs situation thus we must preserve the epoch index as is).

    2.3) The produced block, when imported, will update the state with the current slot value.

    2.4) The epoch index in the state is dynamically computed from the slot. This results in N+X (with X the number of skipped epochs)

image

  1. A node is warp synced

    3.1) The epoch changes tree is constructed using the state at the warp sync point. 3.2) Thus the epoch changes tree first node here will contain as epoch index the one dynamically computed from the slot value within the imported state (i.e. N+X) 3.3) When the block B is imported, the output of VRF is different from the expected one. I.e. a different epoch index has been used wrt the author (N vs N+X)

image


Note that there are several conditions that may prevent the issue to manifest, for example:

  • the warped node waits a bit until the skipped epoch is not anymore an issue
  • the warped node syncs and then only secondary blocks are imported until a new epoch starts.

The proposed solution detects a skipped epoch and correct the epoch index value just before using it in the authoring and in the verification procedures.

davxy avatar Jan 13 '23 11:01 davxy

3. 3.2) Thus the epoch changes tree first node here will contain as epoch index the one dynamically computed from the slot value within the imported state (i.e. N+X)

I maybe miss something, but we import the current_epoch and the next_epoch here. And it calculates the epoch index based on the index stored in the state. Also for the next_epoch we calculate based on data stored in the state.

Only when we are computing the next epoch on a new session, we calculate the index based on the current slot.

Looking a the code I can not really see where the problem arises. We only "correct" the epoch index and start slot of the current epoch in block import if we detect skipped epochs. The warp synced node should do the same? Maybe I miss some small detail, probably again some detail of the epoch changes internals :P

(I'm also confused by your drawings as they miss some explanations what is what etc)

bkchr avatar Jan 13 '23 13:01 bkchr

All you've said is correct. Indeed the warp synced node has the data as it should be and the epoch index that it uses is correct.

After importing the state it uses the reset method here to create a fresh epoch changes tree with two nodes, one current and one next.

Now... when it verifies a block which belongs to the current epoch he will use the epoch data found in the state just before calling reset.

What is not correct is the epoch data used by the author of the block (not warp synced)... To create the first block after the skipped epochs (i.e. current epoch) he will use the information found within the freshest entry in the epoch changes tree (in the right fork obviously) he restored from the aux data.

This epoch changes tree entry has all the stuff right except the epoch-index value, that is out-of-sync... it still has the old epoch-index value.


At this point, for the same block, validator node epoch-index != warp-synced node epoch-index.

And since epoch-index is part of VRF input follows that there is a mismatch during a verification.


The code that I've added detects this in the authoring / verification code and tweaks the epoch index to the proper value.

The adjustment is also done in the verification code to also let the other nodes (not warp synced) to use the proper value (they will have an out of sync epoch changes tree as well)

I hope that is clear :-)

davxy avatar Jan 13 '23 14:01 davxy

At this point, for the same block, validator node epoch-index != warp-synced node epoch-index.

Sorry for the dumb question, but shouldn't be both be the same? The non warp synced node is keeping the old epoch index until we see a next_epoch_digest and then realize that our current epoch is not correct (skipped epochs). The warp sync node will fetch the current epoch from the state that should have the same epoch index as the non warp synced node has in its epoch changes? Can you maybe write down for someone like me what kind of epoch indexes both nodes would have after the skipped epoch? And to which block exactly would the warp sync node need to jump to trigger this? The block before the skipped epochs or after this? The node would also not be able to jump directly into the skipped epochs, it would need to jump to before or after (after we have finished the epoch where we had skipped epochs). I'm confused xD I need some example please xD Just epoch numbers and where the node jumps to etc. I have the slight feeling that you have forgotten again that we don't just jump to epochs and still need to sync the last epochs after the last set change, but maybe I'm just somewhere else right now with my head.

Nevertheless your changes look correct :P

bkchr avatar Jan 13 '23 15:01 bkchr

index as the non warp synced node has in its epoch changes? Can you maybe write down for someone like me what kind of epoch indexes both nodes

Lets concentrate the attention (for simplicity) on two nodes of the network:

  1. a validator V
  2. a warp synced full-node W.

Some epochs are skipped, V restarts, at the wrong moment W attempts a warp sync.


This is the history of V:

  1. is at epoch E9 and produces a block B90 which announces epoch E10.
  2. this is imported by everyone.
  3. Everyone dies.
  4. resumes on epoch E20. And restores the epoch changes tree from aux data.
  5. as epoch data for E20 he uses the best he can use (i.e. the data for the first skipped epoch E10 he finds in the tree).
  6. produces a block B90 (using epoch index = E10) <<< NOTE he uses the one that was saved in epoch changes tree... with the fix in this PR this is corrected just in time here
  7. imports block B90.
  8. Client code detects skipped epoch. BUT (and this is the crucial point) doesn't change the epoch index in the epoch data node. The detection is used only to correctly create the next epoch change node in the tree. Note: here the epoch data fetched from the tree is changed with the corrected index. But this is a clone see here!!!
  9. In the state the CurrentIndex is set using the slot of creation of block B90 (set to 20) here
  10. At some point B90 is finalized
  11. We are still in epoch E20, which (in validator V) borrowed epoch data from E10 (has index 10 in the epoch changes tree...)

This is the history of W (warp synced node)

  1. warp sync to finalized block B90
  2. current epoch (E20) node data is read from the state using current_epoch().
  3. reset creates the current and next nodes in the epoch changes tree using the info read from the state
  4. When this node imports all blocks which belong to E20 there will be a mismatch in the VRF (since the producer has used index 10)

If is still not clear maybe we can have a chat?

davxy avatar Jan 13 '23 16:01 davxy

@bkchr don't know if you already read the comment. But I just changed a couple of things to make it more clear.

davxy avatar Jan 13 '23 16:01 davxy

8. But this is a clone see here!!!

This was the important piece I was missing :D This explains why the indexes are different, thank you!

For the pr in general, as we have now replicated the same functionality around 3 times in the crate can you put them into a function? Maybe returns an Option<(EpochIndex, StartSlot)> when there were skipped epochs.

bkchr avatar Jan 14 '23 07:01 bkchr

I've implemented a clone_for_slot method for the Epoch struct.

This method clones all the epoch information and adjust all the fields depending on the slot value.

Even though the full clone is not really required for authoring and verification (the only thing that we require is the fixed epoch index) having a method doing a generic and meaningful operation is more "elegant" than having a method with some weird name that only returns the fixed epoch index and a start slot.

Cloning the full epoch data is obviously more expensive but should be considered that this happens only if there are (or ever will be) some epochs skipped by Babe (realistically never in a prod network). Also there are a lot of more expensive operations around :-)

davxy avatar Jan 16 '23 17:01 davxy

Is it also possible to get a test for this? :D Some unit test should be enough.

You are right. Is worth it

davxy avatar Jan 18 '23 17:01 davxy

Is there a solution to recovering from this issue on the collator nodes? We just ran into that on our testnet. Normally, we'd just reset the testnet but we are currently running a competition on the testnet and a reset would not be ideal right now.

nud3l avatar Feb 18 '23 08:02 nud3l

Same comment as on the issue.

bkchr avatar Feb 18 '23 16:02 bkchr

Sorry to come to this PR, but it goes against the mental model that I have of how BABE works.

warp sync to finalized block B90 current epoch (E20) node data is read from the state using current_epoch().

Isn't that the actual problem? Shouldn't the warp synced validator be reading epoch 10 from the state?

You say yourself when describing V:

is at epoch E9 and produces a block B90 which announces epoch E10.

Block 90 starts epoch 10. Not 20.

tomaka avatar Jul 24 '23 11:07 tomaka

I also don't understand this line:

resumes on epoch E20. And restores the epoch changes tree from aux data.

You don't "resume on epoch 20". To me the epoch number isn't normally changing automatically over time but only when blocks are authored. No matter how long the validator was dead, even years, it still resumes on epoch 9 or 10 when it comes back online.

tomaka avatar Jul 24 '23 12:07 tomaka

The changes here come as a consequence of #11727. Epochs are defined in slots, and slots are a measure of real-world time. So even if there are no blocks, the slot numbers are still increasing and by extension the epochs should also keep increasing. This guarantees that given a slot number you can always place it on the correct epoch. The runtime already tracks the epoch index correctly given the slot number (it will skip over epochs that didn't happen), but this does require changes on the client-side in order to re-use the data from the latest epoch announced in case something was skipped (i.e. authorities and randomness). This should be reflected on the spec, sorry for that. I have created an issue for this https://github.com/w3f/polkadot-spec/issues/681.

andresilva avatar Jul 24 '23 16:07 andresilva

Thanks, I see.

Am I right that the epoch_index serves no real purpose anymore (as it can always be computed on the fly from the slot number), and is kept just for backwards compatibility?

tomaka avatar Jul 24 '23 16:07 tomaka

Yes, for example the epoch_index is not part of the digest that's included in the block whenever an epoch ends (https://github.com/paritytech/substrate/blob/master/primitives/consensus/babe/src/digests.rs#L130C1-L136C2). It is still tracked by the runtime and is returned in the Epoch struct of the current_epoch and next_epoch runtime APIs, and it is correctly updated by the runtime if epochs are skipped. But indeed, it is redundant given that it can be calculated from the slot number (you also need know the genesis slot though).

(If in the future we want to be able to dynamically change the epoch length, then this computation of slot -> epoch would need to account for that. In which case, having the epoch_index easily available could be more useful.)

andresilva avatar Jul 24 '23 16:07 andresilva