snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

[Fix] Correct quorum threshold calculation in test function is_round_reached()

Open ungaro opened this issue 1 year ago • 1 comments

Motivation

This PR updates the quorum threshold calculation in the is_round_reached function to ensure correct Byzantine fault tolerance. The current implementation uses a simple majority calculation, which is insufficient for Byzantine fault tolerance in certain cases.

Current implementation:

let quorum_threshold = self.validators.len() / 2 + 1;

Proposed implementation:

let quorum_threshold = self.validators.len() - (self.validators.len() - 1) / 3;

The new calculation is based on the principle that in a Byzantine fault-tolerant system with N validators, we must be able to tolerate f faulty validators, where N = 3f + 1 + k (0 <= k < 3). The correct quorum threshold is N - f, which our new calculation achieves.

Quorum Threshold Comparison

Comparison between current implementation (n / 2 + 1) and proposed implementation (n - (n - 1) / 3) for various ranges of validator counts (n).

Range: 1 <= n < 10

n Current (n/2 + 1) Proposed (n - (n-1)/3) Correct?
1 1 1 Yes
2 2 2 Yes
3 2 3 No
4 3 3 Yes
5 3 4 No
6 4 5 No
7 4 5 No
8 5 6 No
9 5 7 No

Range: 99 < n < 110

n Current (n/2 + 1) Proposed (n - (n-1)/3) Correct?
100 51 67 No
101 51 68 No
102 52 68 No
103 52 69 No
104 53 70 No
105 53 70 No
106 54 71 No
107 54 72 No
108 55 72 No
109 55 73 No

Range: 999 < n < 1010

n Current (n/2 + 1) Proposed (n - (n-1)/3) Correct?
1000 501 667 No
1001 501 668 No
1002 502 668 No
1003 502 669 No
1004 503 670 No
1005 503 670 No
1006 504 671 No
1007 504 672 No
1008 505 672 No
1009 505 673 No

Key Observations:

  1. For n < 4, both implementations give the same result, but the current implementation is correct only by coincidence.

  2. Starting from n = 3, the proposed implementation consistently requires a higher quorum threshold, which is necessary for Byzantine fault tolerance.

  3. As n increases, the difference between the two implementations becomes more pronounced. The current implementation significantly underestimates the required quorum for larger validator sets.

  4. The current implementation always returns a value close to 50% of n, which is insufficient for Byzantine fault tolerance.

  5. The proposed implementation correctly calculates the Byzantine fault-tolerant quorum (approximately 2/3 of n) for all values of n.

This comparison demonstrates that the proposed implementation provides the correct quorum threshold for Byzantine fault tolerance across all ranges of validator counts, while the current implementation fails to do so in almost all cases where n > 2.

ungaro avatar Aug 24 '24 14:08 ungaro

would like to get a review, thanks. @bendyarm @d0cd

ungaro avatar Aug 24 '24 14:08 ungaro

This may be trivial as the changes only affect the assumptions of a testing only TestNetwork. @niklaslong Wanted to double check the intention of 2f+1 here.

raychu86 avatar Jan 24 '25 22:01 raychu86

I don't know the details of this particular testing code, but I confirm, with formal verification certainty, that n-f is the general correct quorum number. 2f+1 only works if n = 3f+1, but it fails (with easy counterexamples) if n = 3f+2 or n = 3f+3. Here f is defined the maximum one that satisfies f < n/3.

acoglio avatar Jan 25 '25 01:01 acoglio

@acoglio @raychu86 @niklaslong For more background see my notes on the issue #3377 . I did not fix this one at the time I did PR#2526 because it wasn't clear to me what we were using the test for and there were multiple problems with the test code.

bendyarm avatar Jan 25 '25 03:01 bendyarm

This is indeed a mistake, thanks for correcting it!

As for why, I'm fairly certain this was leftover from debugging flaky tests with larger numbers of validators—the asynchronous nature of the tests means they still sometimes fail with the corrected bound on debug builds. We also use N = 4 in tests, so thankfully the bound is equivalent there.

niklaslong avatar Jan 28 '25 16:01 niklaslong