flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

[EFM | M2] Epoch State Machines should not use `parentState` in their business logic

Open AlexHentschel opened this issue 9 months ago • 2 comments

Problem Definition

We have different state machines for orchestrating Epoch transitions in the package state/protocol/protocol_state/epochs. The current implementation use the parentState in their business logic in numerous places.

In my opinion, this is extremely error prone. To the best of my understanding, the current code has at least one security-critical bug related to this here. Specifically, buildNextEpochActiveParticipants has a check to enforce that already ejected nodes are not re-admitted. For the current logic, two service events eject node X followed by a Epoch Setup in the same block could be a blind spot for this check. Because we always compare to the parent state (where ejection has not jet been applied). The current code would accept an Epoch Setup, where node X is still listed as a participant and override the prior ejection.

In summary, when reviewing the state machines in the package epochs there are various places where I could not convince myself of the correctness of the code, because the parent state was just used without clear documentation why this is safe.

Context

I think this design flaw is in part originating from our separation of ProtocolStateEntry and RichProtocolStateEntry:

  • While the parent state is available as RichProtocolStateEntry, the evolved protocol state is only maintained as ProtocolStateEntry.
    • This is on purpose, because we don't want to deal with the complexity of also maintaining the CurrentEpochIdentityTable and NextEpochIdentityTable throughout the epoch state evolution. As those fields are part of the RichProtocolStateEntry, we only work with the ProtocolStateEntry when evolving the state.
    • Unfortunately, this has the undesired side effect that the evolving state has reference to the full EpochSetup and EpochCommit events. Whenever we need those, we need to go back to the parent state. And this is where the confusing mix of using the evolving state as well as the unevolved parents originates in my opinion.
  • Furthermore, we have various service methods that work on the RichProtocolStateEntry, because they want the service events though don't care about the Identity tables. Thereby we have further instances, where we might erroneously use the parentState instead of the updated state

Goal

The goal of this issue is three-fold:

  1. I would be inclined to remove as many usages of parentState as possible. In my opinion, the only place should be the state machine constructor, where we copy the parent.
  2. In all remaining places where we cannot avoid using parentState, we must include a detailed comment explaining why this is safe even in the case of prior ejection events sealed in the current block.
  3. include tests for the HappyPathStateMachine as well as the FallbackStateMachine that confirm the correct behaviour in the following case:
    • we have one block sealing two service events: eject node X followed by a Epoch Setup
    • Epoch Setup lists node X participant, which would illegally override the prior ejection. As the smart contract emitted the ejection event before, it should account for this and not include X in the setup.
    • Encountering such a service event should trigger EFM.

suggestion

I would suggest to add an interim struct between ProtocolStateEntry and RichProtocolStateEntry. In more detail:

  • rename ProtocolStateEntry to EpochMinStateEntry
  • add struct EpochStateEntry:
    type EpochStateEntry struct {
      *EpochMinStateEntry
    
      PreviousEpochSetup        *EpochSetup
      PreviousEpochCommit       *EpochCommit
      CurrentEpochSetup         *EpochSetup
      CurrentEpochCommit        *EpochCommit
      NextEpochSetup            *EpochSetup
      NextEpochCommit           *EpochCommit
    }
    
  • rename RichProtocolStateEntry to EpochRichStateEntry and base it on EpochStateEntry
    type EpochRichStateEntry struct {
      EpochStateEntry
    
      CurrentEpochIdentityTable IdentityList
      NextEpochIdentityTable    IdentityList
    }
    
  • Most of the existing code will only care about EpochMinStateEntry and EpochRichStateEntry (as before)
  • However, code that needs the EpochMinStateEntry plus the actual service events would use EpochStateEntry. In particular, this would be the state machines. Thereby, we can cleanly avoid any usage of the parentState in the logic that processes service events.

AlexHentschel avatar Jun 01 '24 05:06 AlexHentschel