oasis-core icon indicating copy to clipboard operation
oasis-core copied to clipboard

go/worker/storage/committee: Explicit error for missing runtime block header

Open martintomazic opened this issue 2 months ago • 0 comments

Problem

Example error:

000000000000000000000000000000000000000000000000e199119c992377cb): failed to get block for round 10240341 (current round: 10365418): roothash: block not found","level":"error","module":"worker/storage","msg":"worker stopped","ts":"2025-11-14T09:33:22.7743955Z"}

This happens when your runtime's State DB latest round is older than the runtime's light history last retained round, which results in the worker fetching the light history header it does not have.

So far I have seen two situations in practice:

  1. You restore consensus from the backup that is 3 months old, and set the pruning to two weeks. Later you also decide to add runtime to it and again you also restore it from the 3 months old snapshot, however you forget to restore 3 months old runtime light history as well. History reindex would than manually reindex consensus, but as you have pruning set to two weeks, you will only reindex last 2 weeks.
  2. People remove consensus state (due to corruption) and do consensus checkpoint sync, but keep the runtime state whose latest state is older than the consensus checkpoint they just restored from. Again this would produce a gap in the light history.

Solution

Make this error log more explicit, and suggest how to intervene manually.

Optimal solutions

  1. Databases should be initialized before starting any workers, and their data sanity checked for any corruption before starting any workers, preventing this error in the first place.
    • Obviously you still stop the node and print what is wrong.
  2. New runtime p2p protocol for light headers, fetch in reverse if missing reindex gap.

Alternative solution

We could also remove runtime state up to the last retained light history round. I don't like this:

  1. We should avoid clearing state without operators consents.
  2. If the runtime state is big this won't work due to #6334.

martintomazic avatar Nov 14 '25 12:11 martintomazic