nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Develop infrastructure for ensuring 99% effective uptime for our fleet

Open zah opened this issue 4 years ago • 2 comments

The operation of our fleet is still frequently disrupted by various technical problems. We can improve the situation by employing:

  • Canary releases to a subset of the fleet
  • More comprehensive metrics detecting missed validator duties
  • Automated alerts based on metrics and log analysis

Furthermore, we can prepare a guide for monitoring the fleet, so this responsibility can be rotated within the team and potentially assigned to new team members with less intimate knowledge of the beacon node implementation.

zah avatar Nov 09 '20 17:11 zah

One particular metric that we agreed to implement is checking whether the attestation that you publish are included as expected in newly created blocks and potentially identifying reasons why they haven't been. For example:

  • Incompatible vote
  • Missing blocks
  • Overpopulated blocks
  • Etc

zah avatar Nov 09 '20 18:11 zah

https://github.com/status-im/nimbus-eth2/issues/1809 is another issue requesting a similar metric and a local alert through logging.

zah avatar Nov 10 '20 21:11 zah