vespa
vespa copied to clipboard
Improve observability of cluster state convergence issues on the Cluster Controller
Background
Vespa uses algorithmic data placement, which means all cluster nodes need to agree on the inputs to the algorithm to be able to compute the same output. This is critical to ensure all distributors deterministically agree on which nodes should have exclusive ownership for data buckets—bucket ownership sets must never overlap for a given cluster state version. The algorithm input is derived from two sources:
- The cluster configuration, provided by the config server. This contains the set of all configured nodes and the hierarchy they're located in (groups).
- The cluster state, provided by the cluster controller. This contains the subset of the configured nodes that are actually currently available.
The cluster controller continuously polls all configured nodes for health and connectivity and maintains a set of nodes it considers healthy. When this set changes, a new cluster state version is published. A version is considered to have converged in the cluster when all available nodes in the cluster state have individually successfully reported that they have converged to the given version.
For a distributor node to converge, it must successfully negotiate bucket sets with all content nodes potentially affected by the new cluster state. This may be all content nodes in the cluster. To ensure that negotiation happens with the same inputs across both distributor and content node, the negotiation request contains a compact representation of the current cluster configuration. If this configuration does not match exactly across distributor and content node, the request is explicitly and immediately rejected.
Problem
If one or more nodes have somehow diverged in their configurations (possibly caused by network partitions to the config system, or stale config caused by pointing to outdated config servers, bugs etc.) the cluster will fail to converge to the new cluster state version.
Pinpointing that this is taking place—and which node(s) have stale config—requires checking Vespa logs and using a divining rod on the "SSV" (System State Version) column on the cluster controller.
Proposal
The cluster controller should be able to detect—and communicate—that one or more nodes are failing to converge, and what this set of nodes is. If required, we could add metadata from the distributors as part of their health responses.
- add prometheus metric in case one or nodes are failing to converge
The cluster controller now emits two relevant metrics for determining cluster divergence:
-
cluster-controller.nodes-not-converged
(with anode-type
label of eitherstorage
ordistributor
) stating the number of nodes that have not converged to the most recent cluster state version for at least 30 seconds. It is expected that this number remains 0 under normal operation. -
cluster-controller.cluster-buckets-out-of-sync-ratio
gives a number in [0, 1] for the ratio of data buckets in the system that are out of sync (or are not in their ideal location and will be moved). Note that since this is at bucket granularity and is binary per bucket, it may often "over-count"—i.e. if only 1 document out of 1000 is out of sync in a bucket, the bucket itself will still be counted as if fully out of sync. In a quiescent system this number should be 0.
Note that Prometheus exporting will will normalize .
and -
in metric paths to _
.
Related to the issue outlined in the original comment, we're in the process of changing the semantics of how distribution config is propagated to content nodes. Instead of distribution config being received via a "side channel", it will be managed and sent from the cluster controller itself as part of the versioned cluster state. This eliminates this source of state divergence on the nodes. We are currently rolling out this internally, and will make this the default soon.