snarkOS [Bug] Certificate fetching redundancy must be increased

🐛 Bug Report

I think this line might be technically wrong: https://github.com/ProvableHQ/snarkOS/blob/mainnet/node/bft/src/helpers/pending.rs#L55

If stake is spread among the validators unevenly, and a rich node is offline, then we might want to fetch from all of the remaining validators.

Short term fix: re-evaluate how long it takes for these requests to time out and re-issue. We should not dumbly increase the number of requests as this will negatively impact performance.

Long term fix: consider adjusting the Pending queue to account for the stake of the respective peers and adjust redundancy accordingly.

Sep 16 '25 16:09 vicsn

1 + (3f+1)/3 = n/3 +1 is the availability threshold, which in theory should be sufficient for having at least 1 honest response. However that indeed does not take stake weight into account and just relies on the number of validators.

Sep 17 '25 18:09 raychu86

If stake is spread among the validators unevenly, and a rich node is offline, then we might want to fetch from all of the remaining validators.

Maybe the confusion here comes from the academic definition of f covering all failures, not just Byzantine failures. It does not allow for a node that has more than f stake to ever become unavailable (ignoring network partitions).

However, in practice, we should indeed not rely on a single node to respond. That might be why this function ignores stake?

Sep 18 '25 20:09 kaimast