keep-ecdsa icon indicating copy to clipboard operation
keep-ecdsa copied to clipboard

Monitor third party node downtime

Open pdyraga opened this issue 4 years ago • 0 comments

Keep ECDSA client offers plenty of metrics and diagnostics allowing to monitor the health of the node. However, there is no obvious way to monitor the health of third-party nodes which could be important especially if the node is a member of n-of-n threshold keep with the node being offline. Having an easy way to determine which nodes are offline and what is the impact could help operators to alert each other before a signature is requested from a keep.

One option to achieve it is to start warning in logs if a node sees a peer drop from their list for more than N minutes while they still have an active stake/keeps. We could also limit the warnings to the nodes with which the node being operated has active keeps with.

Another option, not requiring any change in the client, could be a remote telemetry service. The node exposes diagnostics with the list of connected peers that together with the graph can be used to identify offline operators that still have active keeps. This option could be even further enhanced by modeling the network topology for operators who opt-in to the mechanism and submit their diagnostics periodically.

pdyraga avatar Jun 01 '21 09:06 pdyraga