oasis-core icon indicating copy to clipboard operation
oasis-core copied to clipboard

Fix `runtime.history.History.Watchblocks` to emit sequential blocks (no gaps)

Open martintomazic opened this issue 10 months ago • 2 comments

Subscription to runtime.client.api.WatchBlocks (see) may return block sequence with missing rounds.

This may happen because runtime.history.History is not notifying blocks if they are Committ-ed via notify=false (during reindex)**.

Note this is only problematic if node !hasLocalStorage, as otherwise storage syncing (worker.storage.committee.Node) is responsible for notifying blocks via StorageSyncCheckpoint. This worker subscribes to roothash.ServiceClient.WatchBlocks, which may have gaps, however it fetches missing rounds.

As discussed in private, roothash.ServiceClient should probably not be responsible for reindexing, nor should roothash.BlockHistory know anything about storage, node types etc... Likely this code needs a thorough refactor.

** We are guaranteed to have at least two reindexes, possibly more if there are errors during subsequent event processing.

martintomazic avatar Feb 27 '25 10:02 martintomazic

As discussed in private, roothash.ServiceClient should probably not be responsible for reindexing, nor should roothash.BlockHistory know anything about storage, node types etc... Likely this code needs a thorough refactor.

Idealy, we move reindexing part from roothash.ServiceClient to something like worker.runtime.xxx. The worker should be initialized, just like others, after we have initialized RuntimeRegistry. Worker should first fetch runtime.history.History.LastConsensusHeight and reindex until n blocks away from consensus sync. At this point, it should subscribe to roothash.WatchBlocks, do another reindex to fill possible gap and then proceed with regular processing. After that, blocks should be always notified without any gaps.

What about problematic runtime.history.History?

  • runtime.history.History does two things:
    1. it implements roothash.BlockHistory, with pruning policy
    2. it exposes methods that take storage sync into account.
  • We could fix this by either:
    1. removing runtime.history.History interface altogether and reference concrete implementation directly,
    2. split existing block history abstraction into two,
    3. moving methods that take storage sync into account to the new worker.
  • The Problem is that we implement api.RuntimeClient.WatchBlocks, by subscribing to History.WatchBlocks.
    • Whichever solution we will have, we are still hiding away from clients of api.RuntimeClient.WatchBlocks (7 rn), that behavior depends on the underlying node storage type.
    • I prefer explicit parameters && documentation, over hidden flags/state, but then this would mean exposing node/storage type as part of our runtime client API :/?

Alternatively (not ideal), we may push for smtg similar to #6079, or keep it even simpler and use Commit/(Batch) with notify=true, during subsequent reindexes.

I think it would be nice to refactor properly, but also depends on our priorities.

martintomazic avatar Feb 28 '25 00:02 martintomazic

Idealy, we move reindexing part from roothash.ServiceClient to something like worker.runtime.xxx.

Agree. I tried to do that in #6089.

peternose avatar Feb 28 '25 13:02 peternose