Description

This PR aims to add support for the Persistence Analysis issue proof from the IBFT Analysis paper: Section 5 Persistence Analysis, Lemma 9 Proof

Essentially, it aims to prove that for any given proposal P at sequence S, there can be a series of events that would cause 2 honest validators to insert a sealed proposal P and P' respectively, for sequence S, where P != P'.

The series of steps is simulated by using state machine snapshots from the PBFT implementation, where only the Backend interface is mocked, and all states are simulated as they would actually run.

Test in quesiton is TestPBFT_Persistence.

Terminology

N -total number of nodes F - number of tolerated faulty nodes v - random validator from the set validator set W - all validator nodes, apart from v, including the Byzantine nodes Whonest - all honest validator nodes in W

Prerequisites

The persistence problem is present for N >= 4 and F >= 1.

v is not a proposer for any proposal.

Assumptions

The following assumptions are present in the TestPbft_Persistence test:

Node 0 is the proposer for the first proposal, sequence 1, round 0
Node 1 is the proposal for the second proposal, sequence 1, round 1

N = 5 F = 1

v = Node 2 W = [Node 0, Node 1, Node 3, Node 4] Whonest = [Node 0, Node 1, Node 3] Byzantine Node = Node 4

Steps outlined in the paper

The proposer for sequence 1, round 0 (Node 0) sends out a PREPREPARE message to all other nodes (W and v) for proposal P, and moves to Validate state
All nodes gossip PREPARE messages, receive enough and move to Validate state
In Validate state, all honest nodes (Whonest and v) gossip COMMIT messages, and lock on proposal P
The byzantine node (Node 4) sends a valid COMMIT message to v (Node 2), but a malformed COMMIT message to all nodes in Whonest *
v (Node 2) receives Quorum COMMIT messages and moves to Commit state to insert the proposal to the backend
v goes into Done state after inserting the proposal, finishes the cycle
All other honest nodes Whonest (Node 0, Node 1, Node 3) receive a malformed COMMIT message, and trigger a ROUND CHANGE message, moving to the Round Change state *
All nodes in Whonest, as well as the byzantine node (Node 4) agree to move to round 1 in the Round Change state
The proposer for sequence 1, round 1 is Node 1
Node 1 sends out a PREPREPARE message to all other nodes, for proposal P', and moves to Validate state
All other nodes in W gossip a PREPARE message and move to Validate state
All nodes in W gossip COMMIT messages, receive enough and lock on proposal P'
All nodes in W successfully validate the proposal (the byzantine node doesn't send any malformed messages in round 1)
All nodes in W successfully insert proposal P' to the backend

At this point, Node 2 (v) has inserted proposal P for sequence S, and nodes Node 0, Node 1, Node 3, Node 4 have inserted proposal P' for sequence S, thus causing a persistence problem.

* There is no need to simulate a malformed gossipped commit message in the test, ~since pbft-consensus doesn't handle commit message verification, and leaves it up to the Backend interface to verify everything during Commit state~. This part of the test was just mocked so the Backend interface failed to insert the proposal for nodes in W (simulating an invalid commit signature error for example).

EDIT: Following @ferranbt's PR which was merged out in the meantime, pbft-consensus actually has the ability to verify commit messages with the backend. This test can be changed to override the verify commit message handler for the Backend interface, so it's more closer to the original paper steps.

Additional changes

The PR also moves out test helper code that was previously present in consensus_test.go to consensus_testing.go and in state_test.go to state_testing.go, respectively.

Feb 17 '22 21:02 zivkovicmilos

Hi @zivkovicmilos. Happy to see the persistence test. A couple of comments on the PR that we can discuss.

First, we avoid on the same PR to refactor the business logic (i.e. add a test to detect the persistence liveness) of the consensus and clean the arquitecture (i.e. move test utilities on separated files), specially if new files are being created. This is because the review UI in Github gets tricky and it is easy to miss stuff.

Second, you hit the point that we discussed over PR-22. Complex situations that coordinate multiple nodes get a bit convoluted when done in unit tests.

This is why we follow a layered test arquitecture on this repo: unit tests (single node state machine, no async calls, no concept of network), integration tests (multi node state machine, with async calls and "simulated network"), fuzz testing (random integration tests).

Though we can argue if the frameworks we provide for each layer are good or not (or could be more optimal), as it is right now, your TestPBFT_Persistence is in a middleground between unit and integration which does not follow the layered architecture we have in place.

Feb 21 '22 09:02 ferranbt

Hey @ferranbt,

Sorry for the late reply concerning this PR.

There are a couple of points I’d like to address here:

First, we avoid on the same PR to refactor the business logic (i.e. add a test to detect the persistence liveness) of the consensus and clean the arquitecture (i.e. move test utilities on separated files), specially if new files are being created. This is because the review UI in Github gets tricky and it is easy to miss stuff.

Completely agree with this 👍 We should always try to keep the scope of any code changes minimal so they’re easily recognizable when doing reviews. I could’ve probably opened up a couple of separate PRs for code cleanup efforts, but decided to keep a part of them centralized here since they’re not drastic changes,

The changes that have been made are really concerning just helper test logic that was contained in consensus_test.go and state_test.go, as mentioned in the PR description. The state_test.go helper code was just moved to state_testing.go (without any additions / removals), and the existing helper code in consensus_test.go was moved out to consensus_testing.go, with additions added for the central test in this PR.

It's no problem to make sure we don't pollute the PR code changes with cleanup in future PRs 💯

Second, you hit the point that we discussed over PR-22 . Complex situations that coordinate multiple nodes get a bit convoluted when done in unit tests.

What do you find complex in the test outlined in this PR?

The PR adds support for setting up a mock PBFT cluster, and for setting hooks for the backend of each individual node - something that was already present in the codebase, but underutilized. This is powerful, not just for the test that I’ve introduced, but also for existing tests - some of which I’ve changed to use this simple hook logic for the backend.

For each backend operation you want executed within a state, you can attach a hook to execute, for each of the nodes. It gives the test maker complete control over the testing environment, as should always be the case.

The flow map that you introduced in the liveness test is considered convoluted because you need to cherry pick and carefully construct gossip omitance paths - not to mention that there is practically no way to define Byzantine messages using that approach, something which is possible to achieve by using the hook implementation for your backend, that you can override on a test by test and (more importantly) on a node to node basis.

This is why we follow a layered test arquitecture on this repo: unit tests (single node state machine, no async calls, no concept of network), integration tests (multi node state machine, with async calls and “simulated network”), fuzz testing (random integration tests).

I get your point - there are some times when you want to just set up a cluster and let it run in the background, and do checking after it goes through all of the states.

The persistence test introduced in the mentioned paper (and the liveness test too) require very careful scene setup for each of the states of the consensus process. These states should to be checked before and after you run each state, to make sure:

The nodes are on track to fulfill the test expectation
No unexpected error occurred during the cycle

There is no need for the persistence test (and liveness test) to run for 5mins and make us rely on logs in the output because of a general lack of assert statements.

TestPBFT_Persistence simply sets up a mock PBFT cluster, mocks the backend implementation and lets the code run on a per state basis - allowing us to do an autopsy of each node’s state before and after they go into another state. It’s quick, runs for just under 2s and can assure us that the persistence problem is present - without the need to set up fancy message passing frameworks for this, as you’ve already had underutilized support for the gossip transport on the backend.

Feb 25 '22 17:02 zivkovicmilos

Kudos, SonarCloud Quality Gate passed!