vespa Distributor should log and emit metrics when it is not able to send bucket info requests to a node

Distributor should log and emit metrics when it is not able to send bucket info requests to a node

Open vekterli opened this issue 6 years ago • 1 comments

When a cluster state changes, a distributor will send bucket info requests to all nodes that are marked as available in the cluster state. These are all nodes that have passed the cluster controller's health checks and that are presumed to be online for serving. However, if there exists a network partition (such as asymmetric connectivity between node subsets), the availability of nodes as seen from a distributor will differ from that seen from the cluster controller.

This will cause state version convergence to fail, and may cause unavailability for client operations.

Today, bucket info requests are silently retried until success or a new cluster state version is retried, meaning visibility is limited. I suggest adding more obvious metrics and log warnings identifying dodgy nodes when such a situation occurs. Automatic monitoring can then be added for these metrics.

May 04 '18 12:05 vekterli

Logging is already added.

Jun 23 '21 11:06 geirst

vespa vespa copied to clipboard

Distributor should log and emit metrics when it is not able to send bucket info requests to a node

vespa
vespa copied to clipboard