Add support for a distributed deployment
Currently, the only prepackaged way to use the distributed deployment is to use the helm chart. Although the helm chart works well, upgrades can be challenging since they require multiple updates to helms values 3x for each region, which creates a lot of toil for clusters that are regularly being upgraded and changed . The other problem with having to update the helm values multiple times is that it creates more opportunites for human error. It would be ideal to let the operator orchestrate the updates across multiple clusters.
The biggest challenge I can see with this is having the operator pull the queue for vmagent so it knows when it is safe for a zone to serve requests again
Looks like a duplicate of https://github.com/VictoriaMetrics/operator/issues/1450
The biggest challenge I can see with this is having the operator pull the queue for vmagent so it knows when it is safe for a zone to serve requests again
helm chart doesn't provide such feature
This request is very similar to https://github.com/VictoriaMetrics/operator/issues/1450, except we need this functionality to be query latency preserving. That may put additional restrictions on the design.
The implementation we'd like to go forward with:
We are adding support for distributed (multi–Availability Zone) deployments in the VictoriaMetrics Operator. This will be introduced as a new custom resource, DistributedCluster(TBD). The resource will orchestrate VMAuth and VMCluster resources, and perform upgrades of VMCluster across different AZs.
Upgrade process
- Disable select queries on vmauth to AZ1.
- Wait until traffic is drained (? or simple timeout is enough).
- Start the cluster upgrade and wait until its status is operational.
- Monitor vmagent`s persisted queues for this cluster. Once queues are drained, wait an additional minute to confirm they remain empty.
- Re-enable select queries on vmauth to AZ1.
- Repeat the same steps for AZ2, AZ3, etc.
The operator requires network access to vmagents in order to scrape their peristed queue size metrics. Decisions are based solely on the state of vmagent persisted queues.
Non-goals
- No automatic rollback on OOMs, crashes, or failures beyond initial Kubernetes controller startup checks.
- No monitoring of select/ingest errors, and no rollback in case of failures.
- Operator intervention may still be required. A pause=true option will allow the operator to stop automation, take manual control, and resolve issues.
- The mechanism only works with vmagent; other collectors are not supported.
This will be introduced as a new custom resource, DistributedCluster(TBD)
I wonder if this can be implemented by extending VMCluster instead. Similar to topology-aware routing in k8s, the input could be a label we could use to distinguish zones (the default is topology.kubernetes.io/zone). This would enable us to be host-granular, too. It seems to be a useful feature even for smaller deployments.
UPD: unless we want VMClusterDistributed to be referencing multiple existing VMClusters and considering each one of them to belong to a particular zone?
Upgrade process Wait until traffic is drained (? or simple timeout is enough).
I suppose we can fetch traffic stats from VMAgent metrics. Is there any other way for it to signal that no traffic is being sent?
Monitor vmagent`s persisted queues for this cluster.
This also happens via vmagent metrics?
The plan to implement this:
- Create a new
VMClusterDistributedCRD - Its spec would require the following settings:
-
clusterVersion- default images tag for all components
-
vmclusters- a list of vmclusters to update
-
vmauth- name of the global vmauth used to switch read traffic to vmclusters
- Upgrade status is recorded in this CRD status
- Record VMCluster generations in
status - When
clusterVersionchanges, the controller will randomly shuffle the vmcluster list and for each will do the following: -
- Update
spec.vmauthconfiguration to avoid reading from this cluster
- Update
-
- Check vmagent metrics from this cluster and wait for the queue to be empty
-
- Update vmcluster to
spec.clusterVersion
- Update vmcluster to
-
- Wait for this cluster to be updated
-
- Switch this cluster back in vmauth config
- Repeat this until all clusters are updated
- When an error occurs retry as long as vmcluster generation matches the generation recorded in the status
Current limitations:
- all clusters are considered to be each "zone", no way to group them and parallelize
- all resources should be operator-managed in order to be discoverable (i.e., no way to implement this for helm-managed vmauth)
- some timeouts may be hardcoded
Discussed with @f41gh7 today, more cases to cover:
- one VMAgent CR ref per distributed cluster, no need to specify one VMAgent ref per VMCLuster CR
- each VMCluster may define a custom VMUser CR
- more maintenance options - not just
clusterVersion, but anyoverrideSpec overrideSpecshould be present at a global and zone-level (i.e., global version and zone-specific replica count)- Two options for object definition:
-
- full spec which creates a CR with a hardcoded name
-
- existing object reference + overrideSpec
- Object definitions apply to VMCluster, VMUser, and VMAgent