[Fix] Correct quorum threshold calculation in test function is_round_reached()
Motivation
This PR updates the quorum threshold calculation in the is_round_reached function to ensure correct Byzantine fault tolerance. The current implementation uses a simple majority calculation, which is insufficient for Byzantine fault tolerance in certain cases.
Current implementation:
let quorum_threshold = self.validators.len() / 2 + 1;
Proposed implementation:
let quorum_threshold = self.validators.len() - (self.validators.len() - 1) / 3;
The new calculation is based on the principle that in a Byzantine fault-tolerant system with N validators, we must be able to tolerate f faulty validators, where N = 3f + 1 + k (0 <= k < 3). The correct quorum threshold is N - f, which our new calculation achieves.
Quorum Threshold Comparison
Comparison between current implementation (n / 2 + 1) and proposed implementation (n - (n - 1) / 3) for various ranges of validator counts (n).
Range: 1 <= n < 10
| n | Current (n/2 + 1) | Proposed (n - (n-1)/3) | Correct? |
|---|---|---|---|
| 1 | 1 | 1 | Yes |
| 2 | 2 | 2 | Yes |
| 3 | 2 | 3 | No |
| 4 | 3 | 3 | Yes |
| 5 | 3 | 4 | No |
| 6 | 4 | 5 | No |
| 7 | 4 | 5 | No |
| 8 | 5 | 6 | No |
| 9 | 5 | 7 | No |
Range: 99 < n < 110
| n | Current (n/2 + 1) | Proposed (n - (n-1)/3) | Correct? |
|---|---|---|---|
| 100 | 51 | 67 | No |
| 101 | 51 | 68 | No |
| 102 | 52 | 68 | No |
| 103 | 52 | 69 | No |
| 104 | 53 | 70 | No |
| 105 | 53 | 70 | No |
| 106 | 54 | 71 | No |
| 107 | 54 | 72 | No |
| 108 | 55 | 72 | No |
| 109 | 55 | 73 | No |
Range: 999 < n < 1010
| n | Current (n/2 + 1) | Proposed (n - (n-1)/3) | Correct? |
|---|---|---|---|
| 1000 | 501 | 667 | No |
| 1001 | 501 | 668 | No |
| 1002 | 502 | 668 | No |
| 1003 | 502 | 669 | No |
| 1004 | 503 | 670 | No |
| 1005 | 503 | 670 | No |
| 1006 | 504 | 671 | No |
| 1007 | 504 | 672 | No |
| 1008 | 505 | 672 | No |
| 1009 | 505 | 673 | No |
Key Observations:
-
For n < 4, both implementations give the same result, but the current implementation is correct only by coincidence.
-
Starting from n = 3, the proposed implementation consistently requires a higher quorum threshold, which is necessary for Byzantine fault tolerance.
-
As n increases, the difference between the two implementations becomes more pronounced. The current implementation significantly underestimates the required quorum for larger validator sets.
-
The current implementation always returns a value close to 50% of n, which is insufficient for Byzantine fault tolerance.
-
The proposed implementation correctly calculates the Byzantine fault-tolerant quorum (approximately 2/3 of n) for all values of n.
This comparison demonstrates that the proposed implementation provides the correct quorum threshold for Byzantine fault tolerance across all ranges of validator counts, while the current implementation fails to do so in almost all cases where n > 2.
would like to get a review, thanks. @bendyarm @d0cd
This may be trivial as the changes only affect the assumptions of a testing only TestNetwork. @niklaslong Wanted to double check the intention of 2f+1 here.
I don't know the details of this particular testing code, but I confirm, with formal verification certainty, that n-f is the general correct quorum number. 2f+1 only works if n = 3f+1, but it fails (with easy counterexamples) if n = 3f+2 or n = 3f+3. Here f is defined the maximum one that satisfies f < n/3.
@acoglio @raychu86 @niklaslong For more background see my notes on the issue #3377 . I did not fix this one at the time I did PR#2526 because it wasn't clear to me what we were using the test for and there were multiple problems with the test code.
This is indeed a mistake, thanks for correcting it!
As for why, I'm fairly certain this was leftover from debugging flaky tests with larger numbers of validators—the asynchronous nature of the tests means they still sometimes fail with the corrected bound on debug builds. We also use N = 4 in tests, so thankfully the bound is equivalent there.