pbft-consensus icon indicating copy to clipboard operation
pbft-consensus copied to clipboard

Liveness fix (introduce relocking concept)

Open Stefan-Ethernal opened this issue 3 years ago • 6 comments

This PR is solution proposal for liveness issue by introducing a re-locking concept. Re-locking enables that node which has locked some proposal in some of the previous rounds, locks a new one if it witnesses enough (2 * F, where F denotes max count of faulty nodes in order to have Byzantine fault tolerant system) COMMIT messages from the most recent round, which have the same proposal as the one sent the current proposer via a PRE-PREPARE message. If that is the case, then validator node rejects its locked proposal and immediately locks the one from PRE-PREPARE messages and votes for it by sending COMMIT message. In order to track a round when the given node is initially locked, a lockedRound *uint64 field is introduced to the state struct. When lockedRound is set to nil, it denotes that no proposal is locked (yet).

Stefan-Ethernal avatar Feb 22 '22 11:02 Stefan-Ethernal

@Stefan-Ethernal Could you rebase this branch and see how it affects the two liveness issues we have reported?

ferranbt avatar Apr 05 '22 13:04 ferranbt

@Stefan-Ethernal Could you rebase this branch and see how it affects the two liveness issues we have reported?

You are reading my mind. :smiley: I have done so and both TestE2E_Partition_LivenessIssue_Case1_FiveNodes_OneFaulty and TestE2E_Partition_LivenessIssue_Case2_SixNodes_OneFaulty are executed successfully (there is no timeout). Although the latter one needed more time to get executed (~75s, comparing to the one with five nodes, which needed ~30s). There is still left to see whether this test will pass (https://github.com/0xPolygon/pbft-consensus/commit/27b452b1fe7e492ccc089436a9bbbe024bdb85f6), although I doubt, since as I recall Saša tried running it with liveness fix included, but had no luck. Will follow you up tomorrow how it went.

Stefan-Ethernal avatar Apr 05 '22 14:04 Stefan-Ethernal

There is still left to see whether this test will pass

Are the tests passing? It seems to me they are. I guess independent of whether the e2e pass or not we have to run the fuzz framework on this branch for a while before merging.

ferranbt avatar Apr 05 '22 14:04 ferranbt

Are the tests passing? It seems to me they are. I guess independent of whether the e2e pass or not we have to run the fuzz framework on this branch for a while before merging.

This one (the latest Saša implemented, which was inspired by one fuzz framework failed execution) is still not passing (even with liveness fix): TestE2E_Network_Stuck_Locked_Node_Dropped. This is state of the nodes at the moment of test failure (A_0 and A_1 have locked same proposal, A_2 hasn't locked any proposal and seems it lacks PREPARE messages from A_0 and A_1 in order to lock a proposal eventually, whereas A_3 is dropped after it has locked the same proposal as A_0 and A_1 did):

Node A_0, running: true, isProposalLocked: true, proposal data: &{[7 41 57 72] 2022-04-05 16:51:37.206329941 +0200 CEST m=+1.003160765}
Node A_1, running: true, isProposalLocked: true, proposal data: &{[7 41 57 72] 0001-01-01 00:00:00 +0000 UTC}
Node A_2, running: true, isProposalLocked: false, proposal data: &{[74 193 3 10] 2022-04-05 16:52:36.237117875 +0200 CEST m=+60.033948727}
Node A_3, running: false, isProposalLocked: true, proposal data: &{[7 41 57 72] 0001-01-01 00:00:00 +0000 UTC}

Stefan-Ethernal avatar Apr 05 '22 15:04 Stefan-Ethernal

Let's put this on hold

vcastellm avatar Aug 23 '22 09:08 vcastellm

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

sonarqubecloud[bot] avatar Oct 26 '23 09:10 sonarqubecloud[bot]