celo-blockchain
celo-blockchain copied to clipboard
Investigate Missing Signatures
Describe the bug Validators that appear to be otherwise up are not having their signatures included in the parent aggregated seal.
Or and I have catalogued the following reasons why a signature is not included. This list likely to not be exhaustive:
- A node does receive a pre-prepare message thereby stopping them from participating in consensus for the sequence (and thus not sending the commit).
- A node does not receive a commit from another node and then does not include that commit message in when creating the parent aggregated seal from the aggregated seals & parent commits
- A nodes do not receive (or throw away) commit messages from multiple nodes prior to proposing block (resulting in a adding no or very few extra signatures to the aggregated seal when creating the parent aggregated seal).
Nodes can not receive messages due to long standing issues with peering, or due to potential transient issues. It does not appear that there are any more long standing peering issues on baklava. A long standing peering issue is likely due to a network misconfiguration (nodes could reject inbound messages due to closed ports) or with issues like not enabling a hard fork.
Next Steps To further debug this issue having more information on how nodes are peered and when commits are sent, received, and bundled would be helpful to determine where messages are getting lost or dropped. In addition having logging for when a node did not participate in consensus but should have (by not having a pre-prepare) would be helpful for isolating specific causes.
Once the networks stabilize, exploring patterns of missed blocks could dial in which of the 3 causes is causing issues. Consistent pairing issues causing nodes to miss blocks via mechanism 1 will result in a stable lack of signing from a specific proposer, but mechanism 2 will cause the lack of signing to be associated with the proposer prior to the node that has the network issue (due to shuffled round robin, this order changes across epochs, but is stable inside an epoch).