substrate
substrate copied to clipboard
Babe: bad epoch index with skipped epochs and warp sync
May happen that a warp synced node ends up producing a wrong VRF output in case of 1+ Babe skipped epochs and warp sync.
This new issue emerges given the following conditions:
- Some epochs are skipped

-
Validator is resumed.
2.1) A new block B is produced. 2.2) As epoch data it will use the data intended for the first skipped epoch (N). Note that the epoch data for this block is taken from the epoch changes tree that still contains as epoch-index N. (it has not been updated because different - even future - forks may not contain this skipped epochs situation thus we must preserve the epoch index as is).
2.3) The produced block, when imported, will update the state with the current slot value.
2.4) The epoch index in the state is dynamically computed from the slot. This results in N+X (with X the number of skipped epochs)

-
A node is warp synced
3.1) The epoch changes tree is constructed using the state at the warp sync point. 3.2) Thus the epoch changes tree first node here will contain as epoch index the one dynamically computed from the slot value within the imported state (i.e. N+X) 3.3) When the block B is imported, the output of VRF is different from the expected one. I.e. a different epoch index has been used wrt the author (N vs N+X)

Note that there are several conditions that may prevent the issue to manifest, for example:
- the warped node waits a bit until the skipped epoch is not anymore an issue
- the warped node syncs and then only secondary blocks are imported until a new epoch starts.
The proposed solution detects a skipped epoch and correct the epoch index value just before using it in the authoring and in the verification procedures.
3. 3.2) Thus the epoch changes tree first node here will contain as epoch index the one dynamically computed from the slot value within the imported state (i.e. N+X)
I maybe miss something, but we import the current_epoch and the next_epoch here. And it calculates the epoch index based on the index stored in the state. Also for the next_epoch we calculate based on data stored in the state.
Only when we are computing the next epoch on a new session, we calculate the index based on the current slot.
Looking a the code I can not really see where the problem arises. We only "correct" the epoch index and start slot of the current epoch in block import if we detect skipped epochs. The warp synced node should do the same? Maybe I miss some small detail, probably again some detail of the epoch changes internals :P
(I'm also confused by your drawings as they miss some explanations what is what etc)
All you've said is correct. Indeed the warp synced node has the data as it should be and the epoch index that it uses is correct.
After importing the state it uses the reset method here to create a fresh epoch changes tree with two nodes, one current and one next.
Now... when it verifies a block which belongs to the current epoch he will use the epoch data found in the state just before calling reset.
What is not correct is the epoch data used by the author of the block (not warp synced)...
To create the first block after the skipped epochs (i.e. current epoch) he will use the information found within the freshest entry in the epoch changes tree (in the right fork obviously) he restored from the aux data.
This epoch changes tree entry has all the stuff right except the epoch-index value, that is out-of-sync... it still has the old epoch-index value.
At this point, for the same block, validator node epoch-index != warp-synced node epoch-index.
And since epoch-index is part of VRF input follows that there is a mismatch during a verification.
The code that I've added detects this in the authoring / verification code and tweaks the epoch index to the proper value.
The adjustment is also done in the verification code to also let the other nodes (not warp synced) to use the proper value (they will have an out of sync epoch changes tree as well)
I hope that is clear :-)
At this point, for the same block, validator node epoch-index != warp-synced node epoch-index.
Sorry for the dumb question, but shouldn't be both be the same? The non warp synced node is keeping the old epoch index until we see a next_epoch_digest and then realize that our current epoch is not correct (skipped epochs). The warp sync node will fetch the current epoch from the state that should have the same epoch index as the non warp synced node has in its epoch changes? Can you maybe write down for someone like me what kind of epoch indexes both nodes would have after the skipped epoch? And to which block exactly would the warp sync node need to jump to trigger this? The block before the skipped epochs or after this? The node would also not be able to jump directly into the skipped epochs, it would need to jump to before or after (after we have finished the epoch where we had skipped epochs). I'm confused xD I need some example please xD Just epoch numbers and where the node jumps to etc. I have the slight feeling that you have forgotten again that we don't just jump to epochs and still need to sync the last epochs after the last set change, but maybe I'm just somewhere else right now with my head.
Nevertheless your changes look correct :P
index as the non warp synced node has in its epoch changes? Can you maybe write down for someone like me what kind of epoch indexes both nodes
Lets concentrate the attention (for simplicity) on two nodes of the network:
- a validator V
- a warp synced full-node W.
Some epochs are skipped, V restarts, at the wrong moment W attempts a warp sync.
This is the history of V:
- is at epoch E9 and produces a block B90 which announces epoch E10.
- this is imported by everyone.
- Everyone dies.
- resumes on epoch E20. And restores the epoch changes tree from aux data.
- as epoch data for E20 he uses the best he can use (i.e. the data for the first skipped epoch E10 he finds in the tree).
- produces a block B90 (using epoch index = E10) <<< NOTE he uses the one that was saved in epoch changes tree... with the fix in this PR this is corrected just in time here
- imports block B90.
- Client code detects skipped epoch. BUT (and this is the crucial point) doesn't change the epoch index in the epoch data node. The detection is used only to correctly create the next epoch change node in the tree. Note: here the epoch data fetched from the tree is changed with the corrected index. But this is a clone see here!!!
- In the state the
CurrentIndexis set using the slot of creation of block B90 (set to 20) here - At some point B90 is finalized
- We are still in epoch E20, which (in validator V) borrowed epoch data from E10 (has index 10 in the epoch changes tree...)
This is the history of W (warp synced node)
- warp sync to finalized block B90
- current epoch (E20) node data is read from the state using
current_epoch(). resetcreates the current and next nodes in the epoch changes tree using the info read from the state- When this node imports all blocks which belong to E20 there will be a mismatch in the VRF (since the producer has used index 10)
If is still not clear maybe we can have a chat?
@bkchr don't know if you already read the comment. But I just changed a couple of things to make it more clear.
8. But this is a clone see here!!!
This was the important piece I was missing :D This explains why the indexes are different, thank you!
For the pr in general, as we have now replicated the same functionality around 3 times in the crate can you put them into a function? Maybe returns an Option<(EpochIndex, StartSlot)> when there were skipped epochs.
I've implemented a clone_for_slot method for the Epoch struct.
This method clones all the epoch information and adjust all the fields depending on the slot value.
Even though the full clone is not really required for authoring and verification (the only thing that we require is the fixed epoch index) having a method doing a generic and meaningful operation is more "elegant" than having a method with some weird name that only returns the fixed epoch index and a start slot.
Cloning the full epoch data is obviously more expensive but should be considered that this happens only if there are (or ever will be) some epochs skipped by Babe (realistically never in a prod network). Also there are a lot of more expensive operations around :-)
Is it also possible to get a test for this? :D Some unit test should be enough.
You are right. Is worth it
Is there a solution to recovering from this issue on the collator nodes? We just ran into that on our testnet. Normally, we'd just reset the testnet but we are currently running a competition on the testnet and a reset would not be ideal right now.
Same comment as on the issue.
Sorry to come to this PR, but it goes against the mental model that I have of how BABE works.
warp sync to finalized block B90 current epoch (E20) node data is read from the state using current_epoch().
Isn't that the actual problem? Shouldn't the warp synced validator be reading epoch 10 from the state?
You say yourself when describing V:
is at epoch E9 and produces a block B90 which announces epoch E10.
Block 90 starts epoch 10. Not 20.
I also don't understand this line:
resumes on epoch E20. And restores the epoch changes tree from aux data.
You don't "resume on epoch 20". To me the epoch number isn't normally changing automatically over time but only when blocks are authored. No matter how long the validator was dead, even years, it still resumes on epoch 9 or 10 when it comes back online.
The changes here come as a consequence of #11727. Epochs are defined in slots, and slots are a measure of real-world time. So even if there are no blocks, the slot numbers are still increasing and by extension the epochs should also keep increasing. This guarantees that given a slot number you can always place it on the correct epoch. The runtime already tracks the epoch index correctly given the slot number (it will skip over epochs that didn't happen), but this does require changes on the client-side in order to re-use the data from the latest epoch announced in case something was skipped (i.e. authorities and randomness). This should be reflected on the spec, sorry for that. I have created an issue for this https://github.com/w3f/polkadot-spec/issues/681.
Thanks, I see.
Am I right that the epoch_index serves no real purpose anymore (as it can always be computed on the fly from the slot number), and is kept just for backwards compatibility?
Yes, for example the epoch_index is not part of the digest that's included in the block whenever an epoch ends (https://github.com/paritytech/substrate/blob/master/primitives/consensus/babe/src/digests.rs#L130C1-L136C2). It is still tracked by the runtime and is returned in the Epoch struct of the current_epoch and next_epoch runtime APIs, and it is correctly updated by the runtime if epochs are skipped. But indeed, it is redundant given that it can be calculated from the slot number (you also need know the genesis slot though).
(If in the future we want to be able to dynamically change the epoch length, then this computation of slot -> epoch would need to account for that. In which case, having the epoch_index easily available could be more useful.)