flow-go [EFM] Invalid Service Events shortly after Epoch Commit

[EFM] Invalid Service Events shortly after Epoch Commit

Open AlexHentschel opened this issue 10 months ago • 1 comments

Problem description

Currently, the EpochStateMachine, which orchestrates the Epoch Happy Path and Fallback, has this behaviour:

As of the block that encounters an invalid Epoch ServiceEvent, we engage Epoch Fallback Mode [EFM] and do not process any Epoch transitions anymore. This creates subtle edge cases for future light clients and can potentially drive consensus into an irreconcilable state (not sure)
- Scenario:
  - Imagine that Epoch N ends at view 1000.
  - Block from view 1001 (first block of Epoch N+1) seals a result that has an invalid Epoch Service Event
- How the current implementation will behave:
  - Leader (lets call her Alice) for the first view (1001) in Epoch N+1 constructs its block, so it executes ProcessUpdate on the Epoch State Machine (including the broken Service Event).
  - first, EpochStateMachine realizes that this is the first block of the epoch, so it performs an epoch transition (👉 code).
  - However, while processing the service events, EpochStateMachine will encounter an InvalidServiceEventError here so it transitions to EFM.
  - transitioning to EFM means, we are discarding the interim Epoch state we have so far (including the epoch transition), re-initialize the state with a fresh copy of the parent block's Epoch state and re-apply all the service events.

In my opinion, the consensus protocol has formally reached an irreconcilable state at this point. I think our current implementation would probably just stop producing blocks. Reasoning:

Note that when initializing the FallbackStateMachine, we do not re-apply the epoch transition.
Going into view 1001, Alice thought that she was the leader, based on the leader assignment for Epoch N+1. However, after running the Epoch State Machine, the Epoch state is still in Epoch N in fallback mode. Most likely, Alice is not the leader for view 1001 in EMF of Epoch N.
I don't think our software will handle this edge case. Certainly it is a violation of HotStuffs formal safety requirement: once you commit to a leader selection for some range of views (here the views belonging to Epoch N+1), you cannot change it (slightly simplified). Conceptually, we commit to the leader selection once we commit Epoch N+1.

I think a similar aspect has previously come up for the EFM recovery. Specifically, the EFM recovery cannot change the modus operandi for view ranges that the FallbackStateMachine has already committed.

Suggestion of Problem Solution

Once an Epoch is committed (happy path) to some fork, that Epoch will become active on the specified view -- if this fork is extended beyond the epoch boundary. In other words, also the FallbackStateMachine will enact Epoch transitions that have previously been committed by the happy path protocol.
The aspect where HappyPathStateMachine and FallbackStateMachine differ is the way they add new view ranges beyond the already committed.

Apr 05 '24 00:04 AlexHentschel

Let us consider the following suggestion:

Once a leader selection for a view range is committed, it can never be overwritten/changed

A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) on the happy path

A leader selection view range is committed upon entering EFM on the fallback path

and

If an epoch extension is added, it's appended to the last committed leader selection view range.

Thoughts

There is a subtle detail in the proposed specification that we have to get correctly in order to not break consensus.To paraphrase, we are suggesting to use finality as a decision criterion on whether or not the Epoch State Machine accepts a service event leading to a changed leader selection for a future view range.

General Rule:

Finality cannot be used as an input for evolving the Protocol State. Generally, only information in the fork that is currently being extended can be used to determine the validity of a block. There are exceptions where using finality is safe, but those are generally very edge-casey.

Reasoning:

Finality is a determination that nodes make locally. Very explicitly, nodes that all know the fork A <- B <- C and receiving the candidate D such that A <- B <- C <- D; yet they might still have different finality statuses for the blocks. Specifically, this is because nodes may observe alternative forks that are subsequently orphaned and are not observed by other nodes. Nevertheless, observing subsequently- orphaned forks can (rarely) progress finality on the main fork.

For example, me knowing the fork A <- B <- C <- D, I may conclude that B is finalized. On the one hand, based on my world view C is still unfinalized. On the other hand, some other node may know additional children of C that finalize C. So if we allow finality to influence what Protocol State transitions in block D are legal/illegal, me and that other node may disagree whether D is a valid extension of the chain.
Consensus rules guarantee that finality is eventually consistent. In other words, if some node finalizes block B and if the network continues to produce valid blocks, all honest nodes will eventually conclude that B is finalized.

Suggested change.

Above, I argued that this part of the suggested rule would break consensus:

❌ A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) in the happy path

Lets discuss how we can modify this rule to work out:

I think as a first step, we need to have a definition of "committing a leader selection view range" for a specific fork. I would suggest:
- On the happy path, a leader selection view range is committed for one specific fork, when an EpochCommit event is included in that fork.
- On the unhappy path, a leader selection view range is committed when the EFM logic reaches the threshold view without a valid EpochCommit or EpochRecovery event
With this definition, we can have different committed leader selection view ranges in different forks. This is no problem, as long as such view ranges are sufficiently far into the future.
- Though, by the time the consensus committee for the view range takes over, it must have already been finalized which committee is taking over. In other words, finality is not important when the Epoch State Machine writes the leader selection view range into the protocol state. Finality is important when this view range activates. Conceptually, it is the same mechanics as with protocol version upgrades: we need this safety buffer between writing data into the Protocol State and this taking effect in the network.

Apr 11 '24 00:04 AlexHentschel

flow-go flow-go copied to clipboard

[EFM] Invalid Service Events shortly after Epoch Commit

Problem description

Suggestion of Problem Solution

Thoughts

General Rule:

Suggested change.

flow-go
flow-go copied to clipboard