pbft-consensus
pbft-consensus copied to clipboard
Liveness fix (introduce relocking concept)
This PR is solution proposal for liveness issue by introducing a re-locking concept.
Re-locking enables that node which has locked some proposal in some of the previous rounds, locks a new one if it witnesses enough (2 * F
, where F denotes max count of faulty nodes in order to have Byzantine fault tolerant system) COMMIT messages from the most recent round, which have the same proposal as the one sent the current proposer via a PRE-PREPARE message. If that is the case, then validator node rejects its locked proposal and immediately locks the one from PRE-PREPARE messages and votes for it by sending COMMIT message.
In order to track a round when the given node is initially locked, a lockedRound *uint64
field is introduced to the state
struct. When lockedRound
is set to nil
, it denotes that no proposal is locked (yet).
@Stefan-Ethernal Could you rebase this branch and see how it affects the two liveness issues we have reported?
@Stefan-Ethernal Could you rebase this branch and see how it affects the two liveness issues we have reported?
You are reading my mind. :smiley: I have done so and both TestE2E_Partition_LivenessIssue_Case1_FiveNodes_OneFaulty
and TestE2E_Partition_LivenessIssue_Case2_SixNodes_OneFaulty
are executed successfully (there is no timeout). Although the latter one needed more time to get executed (~75s, comparing to the one with five nodes, which needed ~30s).
There is still left to see whether this test will pass (https://github.com/0xPolygon/pbft-consensus/commit/27b452b1fe7e492ccc089436a9bbbe024bdb85f6), although I doubt, since as I recall Saša tried running it with liveness fix included, but had no luck. Will follow you up tomorrow how it went.
There is still left to see whether this test will pass
Are the tests passing? It seems to me they are. I guess independent of whether the e2e pass or not we have to run the fuzz framework on this branch for a while before merging.
Are the tests passing? It seems to me they are. I guess independent of whether the e2e pass or not we have to run the fuzz framework on this branch for a while before merging.
This one (the latest Saša implemented, which was inspired by one fuzz framework failed execution) is still not passing (even with liveness fix): TestE2E_Network_Stuck_Locked_Node_Dropped
.
This is state of the nodes at the moment of test failure (A_0 and A_1 have locked same proposal, A_2 hasn't locked any proposal and seems it lacks PREPARE messages from A_0 and A_1 in order to lock a proposal eventually, whereas A_3 is dropped after it has locked the same proposal as A_0 and A_1 did):
Node A_0, running: true, isProposalLocked: true, proposal data: &{[7 41 57 72] 2022-04-05 16:51:37.206329941 +0200 CEST m=+1.003160765}
Node A_1, running: true, isProposalLocked: true, proposal data: &{[7 41 57 72] 0001-01-01 00:00:00 +0000 UTC}
Node A_2, running: true, isProposalLocked: false, proposal data: &{[74 193 3 10] 2022-04-05 16:52:36.237117875 +0200 CEST m=+60.033948727}
Node A_3, running: false, isProposalLocked: true, proposal data: &{[7 41 57 72] 0001-01-01 00:00:00 +0000 UTC}
Let's put this on hold
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell
No Coverage information
0.0% Duplication