flow-go
flow-go copied to clipboard
[EFM] Invalid Service Events shortly after Epoch Commit
Problem description
Currently, the EpochStateMachine
, which orchestrates the Epoch Happy Path and Fallback, has this behaviour:
- As of the block that encounters an invalid Epoch ServiceEvent, we engage Epoch Fallback Mode [EFM] and do not process any Epoch transitions anymore. This creates subtle edge cases for future light clients and can potentially drive consensus into an irreconcilable state (not sure)
- Scenario:
- Imagine that Epoch N ends at view 1000.
- Block from view 1001 (first block of Epoch N+1) seals a result that has an invalid Epoch Service Event
- How the current implementation will behave:
- Leader (lets call her Alice) for the first view (1001) in Epoch N+1 constructs its block, so it executes
ProcessUpdate
on the Epoch State Machine (including the broken Service Event). - first,
EpochStateMachine
realizes that this is the first block of the epoch, so it performs an epoch transition (👉 code). - However, while processing the service events,
EpochStateMachine
will encounter anInvalidServiceEventError
here so it transitions to EFM. - transitioning to EFM means, we are discarding the interim Epoch state we have so far (including the epoch transition), re-initialize the state with a fresh copy of the parent block's Epoch state and re-apply all the service events.
- Leader (lets call her Alice) for the first view (1001) in Epoch N+1 constructs its block, so it executes
- Scenario:
In my opinion, the consensus protocol has formally reached an irreconcilable state at this point. I think our current implementation would probably just stop producing blocks. Reasoning:
- Note that when initializing the
FallbackStateMachine
, we do not re-apply the epoch transition. - Going into view 1001, Alice thought that she was the leader, based on the leader assignment for Epoch N+1. However, after running the Epoch State Machine, the Epoch state is still in Epoch N in fallback mode. Most likely, Alice is not the leader for view 1001 in EMF of Epoch N.
- I don't think our software will handle this edge case. Certainly it is a violation of HotStuffs formal safety requirement: once you commit to a leader selection for some range of views (here the views belonging to Epoch N+1), you cannot change it (slightly simplified). Conceptually, we commit to the leader selection once we commit Epoch N+1.
I think a similar aspect has previously come up for the EFM recovery. Specifically, the EFM recovery cannot change the modus operandi for view ranges that the FallbackStateMachine
has already committed.
Suggestion of Problem Solution
- Once an Epoch is committed (happy path) to some fork, that Epoch will become active on the specified view -- if this fork is extended beyond the epoch boundary. In other words, also the
FallbackStateMachine
will enact Epoch transitions that have previously been committed by the happy path protocol. - The aspect where
HappyPathStateMachine
andFallbackStateMachine
differ is the way they add new view ranges beyond the already committed.
Let us consider the following suggestion:
- Once a leader selection for a view range is committed, it can never be overwritten/changed
- A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) on the happy path
- A leader selection view range is committed upon entering EFM on the fallback path
and
- If an epoch extension is added, it's appended to the last committed leader selection view range.
Thoughts
There is a subtle detail in the proposed specification that we have to get correctly in order to not break consensus.To paraphrase, we are suggesting to use finality as a decision criterion on whether or not the Epoch State Machine accepts a service event leading to a changed leader selection for a future view range.
General Rule:
Finality cannot be used as an input for evolving the Protocol State. Generally, only information in the fork that is currently being extended can be used to determine the validity of a block. There are exceptions where using finality is safe, but those are generally very edge-casey.
Reasoning:
-
Finality is a determination that nodes make locally. Very explicitly, nodes that all know the fork
A <- B <- C
and receiving the candidateD
such thatA <- B <- C <- D
; yet they might still have different finality statuses for the blocks. Specifically, this is because nodes may observe alternative forks that are subsequently orphaned and are not observed by other nodes. Nevertheless, observing subsequently- orphaned forks can (rarely) progress finality on the main fork.For example, me knowing the fork
A <- B <- C <- D
, I may conclude thatB
is finalized. On the one hand, based on my world viewC
is still unfinalized. On the other hand, some other node may know additional children ofC
that finalizeC
. So if we allow finality to influence what Protocol State transitions in blockD
are legal/illegal, me and that other node may disagree whetherD
is a valid extension of the chain. -
Consensus rules guarantee that finality is eventually consistent. In other words, if some node finalizes block
B
and if the network continues to produce valid blocks, all honest nodes will eventually conclude thatB
is finalized.
Suggested change.
Above, I argued that this part of the suggested rule would break consensus:
❌ A leader selection view range is committed upon finalizing the EpochCommit event (not EpochSetup) in the happy path
Lets discuss how we can modify this rule to work out:
- I think as a first step, we need to have a definition of "committing a leader selection view range" for a specific fork. I would suggest:
- On the happy path, a leader selection view range is committed for one specific fork, when an
EpochCommit
event is included in that fork. - On the unhappy path, a leader selection view range is committed when the EFM logic reaches the threshold view without a valid
EpochCommit
orEpochRecovery
event
- On the happy path, a leader selection view range is committed for one specific fork, when an
- With this definition, we can have different committed leader selection view ranges in different forks. This is no problem, as long as such view ranges are sufficiently far into the future.
- Though, by the time the consensus committee for the view range takes over, it must have already been finalized which committee is taking over. In other words, finality is not important when the Epoch State Machine writes the leader selection view range into the protocol state. Finality is important when this view range activates. Conceptually, it is the same mechanics as with protocol version upgrades: we need this safety buffer between writing data into the Protocol State and this taking effect in the network.