tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Fix backfill issue when nodes go down

Open SuperFluffy opened this issue 2 weeks ago • 1 comments

Describe the bug

Problem

The execution layer does not reliably persist its latest finalized blocks. At the same time, the consensus layer will never forward finalized blocks to the execution layer twice if processing of the finalized block was previously acknowledged.

This means that in certain scenarios, on restart the node can neither propose nor verify, as the execution layer does not have the block available and neither does the consensus layer forward/backfill the block.

Suggested solution

Suggested by @klkvr:

Compare the results of:

  1. provider.database_provider_ro().last_finalized_block_number()
  2. provider.last_block_number()

If provider.last_block_number() < provider.database_provider_ro().last_finalized_block_number(), there is a hole which can be plugged by forwarding missing blocks from CL to EL.

provider.database_provider_ro().last_finalized_block_number() is written on every FCU if the finalized blocks is known: https://github.com/paradigmxyz/reth/blob/56e60a3704433a7635a1e27be44cdaa6dc558358/crates/engine/tree/src/tree/mod.rs#L2633

On restart, the application sends an FCU(finalized_block_hash = network_finalized_tip), which means that there are two options:

  1. provider.database_provider_ro().last_finalized_block_number() returns None - the network finalized blocks past the finalized block known by the node. This means, the restart node will receive a future finalized block from its peers and trigger the backfill mechanism.
  2. provider.database_provider_ro().last_finalized_block_number() returns Some(height) - the finalized block was known / present in the database; the network did not advance past Some(height). This means, we can check if there is a gap between Some(height) and the return value of provider.last_block_number(), backfilling the gaps if necessary.

Testing

The scenario where finalized blocks are missing can not be reliably triggered. However, with https://github.com/tempoxyz/tempo/pull/936 the regular backfill tests should no longer be flaky.

The nodes should not be connected on the execution layer to observe failure: https://github.com/tempoxyz/tempo/blob/a2f05055a3f52054f6bc0702e78fee9e121f88c8/crates/e2e/src/tests/restart.rs#L179

Background

Note that in most network conditions this will not lead to a problem:

  1. if the network keeps proposing (and eventually finalizing) blocks, the application will attempt to fill gaps from CL to EL (implemented in https://github.com/tempoxyz/tempo/pull/1173)
  2. if nodes are connected via execution layer p2p, the gaps will be filled by fetching them from peers.

The edge cases that would cause a network halt are, for example:

  1. a single node network restarts;
  2. all nodes in the network go down at the same time in a way that all nodes mutually lose the same block (on restart no new blocks will be proposed, gaps will not be filled from finalized blocks, nodes cannot fetch blocks via EL p2p because they all lack the same block);
  3. one or more nodes go down such that the active validators drop below quorum and validators are not connected through EL p2p (no blocks will be proposed, blocks cannot be fetched via EL p2p).

Previously, it was thought that marshal::Mailbox::get_info with Info::Latest addressed this issue. But this returned the network tip (not the latest locally acknowledged finalized block), which triggered a cascade of way too many (and incorrect) backfills. This was addressed in https://github.com/tempoxyz/tempo/pull/1173, again revealing the missing-block-on-restart problem.

Alternatives

Commonware is considering how to support lossy applications: https://github.com/commonwarexyz/monorepo/issues/2424

Steps to reproduce

  1. Run a single-node network
  2. kill and restart it repeatedly
  3. Observe that sometimes the finalized blocks are missing on restart / not persisted.

Logs


Platform(s)

No response

Container Type

Not running in a container

What version/commit are you on?

d44c60f9a6b35c96e106ab18908a8a007e436986

If you've built from source, provide the full command you used

No response

Code of Conduct

  • [x] I agree to follow the Code of Conduct

SuperFluffy avatar Dec 08 '25 17:12 SuperFluffy

note to self, conditional before fill_holes should be looking at best_block_number, not last_block_number

joshieDo avatar Dec 12 '25 16:12 joshieDo