consensus-specs Lightweight state-machine testing

Has there been consideration for adding state-machine testing to (dynamically) test consensus logic (validators, epochs, state transitions, rewards/penalties, etc.)?

For example, a small YAML scenario might look like:

scenario:
  initial_state:
    validators:
      - {id: 1, balance: 32, active: true}
      - {id: 2, balance: 32, active: true}
  actions:
    - action: propose_block
      params: { proposer: "random" }
    - action: slash_validator
      params: { validator: "random" }
  invariants:
    - no_negative_balance
    - total_balance_properly_accounted

This could even use facilities from property-based testing tools (like Hypothesis) to:

Help generate diverse sequences of state transitions
Verify state invariants across all transitions
Replay any found edge cases

The real value here is finding minimal sequences that trigger consensus failures between the spec and implementations.

When client behavior diverges from the specification, state-machine testing could even (try to) reduce complex failure cases to the smallest possible reproducing example.

I think, in any case, this approach can help generate test vectors that are compatible with existing client test runners, maintaining the current workflow while expanding test coverage.

Open to feedback, I might have I've missed context or existing efforts.

Mar 27 '25 22:03 moodmosaic

Hey @moodmosaic, sorry it's taken me so long to respond, it's been a busy couple of weeks. This is a cool idea, something like this could be quite useful! I'm unaware of any efforts exactly like this, but ethereum/consensus-compliance-tests came to mind; a suite of model based tests. See this discord thread about it: https://discord.com/channels/595666850260713488/1323326526070063204

Mar 31 '25 14:03 jtraglia

@jtraglia, thanks for the links! I checked that repo and related materials: #3831, HackMD, ethereum/pm#1096, EthResearch.

That's an interesting piece of work around modeling and mutation-based testing! I didn't see foundational or academic papers cited, making it challenging (for me) to understand the theoretical background behind this. (Perhaps @ericsson49 can add those.)

I'm proposing testing at a different, complementary layer: state-level transitions (validator lifecycle, epochs, rewards, penalties, etc).

Inspired by (Hughes, "Testing the Hard Stuff"; Erlang QuickCheck docs), the goal is:

dynamically generate randomized scenarios (e.g., propose → attest → slash)
verify invariants (e.g., no negative balances, justified checkpoint ≤ finalized)
(attempt to) shrink failing scenarios into minimal, reproducible test cases (long shot IME, but always nice to consider)

AFAICT, current fork-choice testing focuses specifically on block/attestation tree correctness. State-machine testing would broaden coverage to include complex validator-state interactions. (It should not replace existing fork-choice tests.)

I understand this isn't an easy task, but I felt it can still be valuable to share the idea.

Apr 02 '25 12:04 moodmosaic

Hey @moodmosaic,

Perhaps @ericsson49 can add those.

our fork choice test generation strategy is described in this post in more details. There is also our talk at DevCon. Regarding academic papers, it was mainly inspired by these papers:

Apr 02 '25 15:04 ericsson49

Regarding state-level transitions, @mkalinin has been working on Validator lifecycle test generation, using MiniZinc/Python

Apr 02 '25 15:04 ericsson49

AFAICT, current fork-choice testing focuses specifically on block/attestation tree correctness. State-machine testing would broaden coverage to include complex validator-state interactions. (It should not replace existing fork-choice tests.)

Our fork choice test generator follows the state machine testing paradigm. I.e. starting from an initial protocol state, generate diverse sequences of events, which are exactly block | attestation | attester_slashing | tick.

The focus has been indeed on the fork choice spec coverage. The coverage can be extended, but there is a beacon fuzz project, which focuses on differential fuzzing of the state_transition. So, it's perhaps not the top priority, while fork choice coverage had been lagging.

As I wrote earlier, @mkalinin has been working on testing validator lifecycle. There can be other aspects of consensus specs which deserve dedicated testing efforts.

Apr 02 '25 16:04 ericsson49

there is a beacon fuzz project, which focuses on differential fuzzing of the state_transition

As we have discussed with @mkalinin today, it makes sense to test state_transition using the state-machine approach, since implementations use caching and can introduce bugs this way (e.g. querying the cache results in a stale value). It's not clear whether the beacon fuzz approach can catch such bugs in principle, but state machine approach should be able to do it.

Apr 03 '25 12:04 ericsson49

Yes, and to run what I described with beacon_fuzz, you'd have to replace the invariants with state deltas. I'd need to think about what that might look like.

Thoughts on state machine:

Start with a probabilistic baseline (which you get out of the box with Hypothesis, plus all its other facilities, e.g. nice reporting, replaying, even a chance to shrink). — What invariants can I break by generating and executing diverse sequences of state transitions? This is the key question that must be answered.

(Only then move from the probabilistic approach to something more formal or guided.)

Apr 03 '25 13:04 moodmosaic

Expanding on the validator lifecycle testing, there is an effort that we have started to work a while ago but it is still WIP:

There are validator states generated using model written in Minizinc as @ericsson49 wrote before
There are tests that are initialized with a certain validator state and after applying an operation that updates the state it is checked that the resulting validator state is within the valid one

There is a short writeup proposing validator state testing. Also, recently with @ericsson49 we were discussing epoch processing testing leveraging a similar model based approach. It would allow us to catch bugs in clients related to caching validator set and missing updates to the cache.

Apr 04 '25 06:04 mkalinin

What invariants can I break by generating and executing diverse sequences of state transitions? This is the key question that must be answered.

Writing explicit invariants for pyspecs would be definitely beneficial for testing (and for formal verification). There was an attempt to describe function preconditions in doc strings (can be extended to invariants too) #1945. I'm not sure though whether it's a good idea to keep invariants in the spec code, since it can clutter the specs.

Apr 04 '25 09:04 ericsson49

(Only then move from the probabilistic approach to something more formal or guided.)

We are going to generate test vectors to test compliance of implementations wrt other Ethereum specs, like Execution and 3SF specs. It would be nice to use such test vectors to check consistency of the specs themselves, which will be more or less straightforward if invariants (pre-/post-conditions) are in an executable form.

From test generation perspective, we are considering writing models in Python (instead of writing them in MiniZinc), then we can use constraints solvers like MiniZinc/Gecode or SMT solvers to enumerate solutions and convert them to test vectors.

Apr 04 '25 09:04 ericsson49

@ericsson49, @mkalinin, going through the materials you provided.

My (current) thinking is that this effort works well for single-step correctness and edge-state modeling, unsure if efficient enough to catch multi-step lifecycle bugs like activate → attest → exit → slash → withdraw where sequencing matters.

Apr 05 '25 08:04 moodmosaic

My (current) thinking is that this effort works well for single-step correctness and edge-state modeling, unsure if efficient enough to catch multi-step lifecycle bugs like activate → attest → exit → slash → withdraw where sequencing matters.

Yes. For instance, for epoch processing we would need a different model which would take sequences into account. If designed right, we believe such a model can be used to yield all possible combinations of scenarios when one thing affects the other.

Apr 06 '25 08:04 mkalinin