[Bug] Certificate fetching redundancy must be increased
🐛 Bug Report
I think this line might be technically wrong: https://github.com/ProvableHQ/snarkOS/blob/mainnet/node/bft/src/helpers/pending.rs#L55
If stake is spread among the validators unevenly, and a rich node is offline, then we might want to fetch from all of the remaining validators.
Short term fix: re-evaluate how long it takes for these requests to time out and re-issue. We should not dumbly increase the number of requests as this will negatively impact performance.
Long term fix: consider adjusting the Pending queue to account for the stake of the respective peers and adjust redundancy accordingly.
1 + (3f+1)/3 = n/3 +1 is the availability threshold, which in theory should be sufficient for having at least 1 honest response. However that indeed does not take stake weight into account and just relies on the number of validators.
If stake is spread among the validators unevenly, and a rich node is offline, then we might want to fetch from all of the remaining validators.
Maybe the confusion here comes from the academic definition of f covering all failures, not just Byzantine failures. It does not allow for a node that has more than f stake to ever become unavailable (ignoring network partitions).
However, in practice, we should indeed not rely on a single node to respond. That might be why this function ignores stake?