vespa
vespa copied to clipboard
Distributor should log and emit metrics when it is not able to send bucket info requests to a node
When a cluster state changes, a distributor will send bucket info requests to all nodes that are marked as available in the cluster state. These are all nodes that have passed the cluster controller's health checks and that are presumed to be online for serving. However, if there exists a network partition (such as asymmetric connectivity between node subsets), the availability of nodes as seen from a distributor will differ from that seen from the cluster controller.
This will cause state version convergence to fail, and may cause unavailability for client operations.
Today, bucket info requests are silently retried until success or a new cluster state version is retried, meaning visibility is limited. I suggest adding more obvious metrics and log warnings identifying dodgy nodes when such a situation occurs. Automatic monitoring can then be added for these metrics.
Logging is already added.