Fix backfill issue when nodes go down
Describe the bug
Problem
The execution layer does not reliably persist its latest finalized blocks. At the same time, the consensus layer will never forward finalized blocks to the execution layer twice if processing of the finalized block was previously acknowledged.
This means that in certain scenarios, on restart the node can neither propose nor verify, as the execution layer does not have the block available and neither does the consensus layer forward/backfill the block.
Suggested solution
Suggested by @klkvr:
Compare the results of:
provider.database_provider_ro().last_finalized_block_number()provider.last_block_number()
If provider.last_block_number() < provider.database_provider_ro().last_finalized_block_number(), there is a hole which can be plugged by forwarding missing blocks from CL to EL.
provider.database_provider_ro().last_finalized_block_number() is written on every FCU if the finalized blocks is known: https://github.com/paradigmxyz/reth/blob/56e60a3704433a7635a1e27be44cdaa6dc558358/crates/engine/tree/src/tree/mod.rs#L2633
On restart, the application sends an FCU(finalized_block_hash = network_finalized_tip), which means that there are two options:
provider.database_provider_ro().last_finalized_block_number()returnsNone- the network finalized blocks past the finalized block known by the node. This means, the restart node will receive a future finalized block from its peers and trigger the backfill mechanism.provider.database_provider_ro().last_finalized_block_number()returnsSome(height)- the finalized block was known / present in the database; the network did not advance pastSome(height). This means, we can check if there is a gap betweenSome(height)and the return value ofprovider.last_block_number(), backfilling the gaps if necessary.
Testing
The scenario where finalized blocks are missing can not be reliably triggered. However, with https://github.com/tempoxyz/tempo/pull/936 the regular backfill tests should no longer be flaky.
The nodes should not be connected on the execution layer to observe failure: https://github.com/tempoxyz/tempo/blob/a2f05055a3f52054f6bc0702e78fee9e121f88c8/crates/e2e/src/tests/restart.rs#L179
Background
Note that in most network conditions this will not lead to a problem:
- if the network keeps proposing (and eventually finalizing) blocks, the application will attempt to fill gaps from CL to EL (implemented in https://github.com/tempoxyz/tempo/pull/1173)
- if nodes are connected via execution layer p2p, the gaps will be filled by fetching them from peers.
The edge cases that would cause a network halt are, for example:
- a single node network restarts;
- all nodes in the network go down at the same time in a way that all nodes mutually lose the same block (on restart no new blocks will be proposed, gaps will not be filled from finalized blocks, nodes cannot fetch blocks via EL p2p because they all lack the same block);
- one or more nodes go down such that the active validators drop below quorum and validators are not connected through EL p2p (no blocks will be proposed, blocks cannot be fetched via EL p2p).
Previously, it was thought that marshal::Mailbox::get_info with Info::Latest addressed this issue. But this returned the network tip (not the latest locally acknowledged finalized block), which triggered a cascade of way too many (and incorrect) backfills. This was addressed in https://github.com/tempoxyz/tempo/pull/1173, again revealing the missing-block-on-restart problem.
Alternatives
Commonware is considering how to support lossy applications: https://github.com/commonwarexyz/monorepo/issues/2424
Steps to reproduce
- Run a single-node network
- kill and restart it repeatedly
- Observe that sometimes the finalized blocks are missing on restart / not persisted.
Logs
Platform(s)
No response
Container Type
Not running in a container
What version/commit are you on?
d44c60f9a6b35c96e106ab18908a8a007e436986
If you've built from source, provide the full command you used
No response
Code of Conduct
- [x] I agree to follow the Code of Conduct
note to self, conditional before fill_holes should be looking at best_block_number, not last_block_number