nimbus-eth2
nimbus-eth2 copied to clipboard
Develop infrastructure for ensuring 99% effective uptime for our fleet
The operation of our fleet is still frequently disrupted by various technical problems. We can improve the situation by employing:
- Canary releases to a subset of the fleet
- More comprehensive metrics detecting missed validator duties
- Automated alerts based on metrics and log analysis
Furthermore, we can prepare a guide for monitoring the fleet, so this responsibility can be rotated within the team and potentially assigned to new team members with less intimate knowledge of the beacon node implementation.
One particular metric that we agreed to implement is checking whether the attestation that you publish are included as expected in newly created blocks and potentially identifying reasons why they haven't been. For example:
- Incompatible vote
- Missing blocks
- Overpopulated blocks
- Etc
https://github.com/status-im/nimbus-eth2/issues/1809 is another issue requesting a similar metric and a local alert through logging.