operator Add support for a distributed deployment

Currently, the only prepackaged way to use the distributed deployment is to use the helm chart. Although the helm chart works well, upgrades can be challenging since they require multiple updates to helms values 3x for each region, which creates a lot of toil for clusters that are regularly being upgraded and changed . The other problem with having to update the helm values multiple times is that it creates more opportunites for human error. It would be ideal to let the operator orchestrate the updates across multiple clusters.

The biggest challenge I can see with this is having the operator pull the queue for vmagent so it knows when it is safe for a zone to serve requests again

Aug 27 '25 19:08 tiny-pangolin

Looks like a duplicate of https://github.com/VictoriaMetrics/operator/issues/1450

The biggest challenge I can see with this is having the operator pull the queue for vmagent so it knows when it is safe for a zone to serve requests again

helm chart doesn't provide such feature

Aug 28 '25 07:08 f41gh7

This request is very similar to https://github.com/VictoriaMetrics/operator/issues/1450, except we need this functionality to be query latency preserving. That may put additional restrictions on the design.

Aug 28 '25 14:08 pw-mdb

The implementation we'd like to go forward with:

We are adding support for distributed (multi–Availability Zone) deployments in the VictoriaMetrics Operator. This will be introduced as a new custom resource, DistributedCluster(TBD). The resource will orchestrate VMAuth and VMCluster resources, and perform upgrades of VMCluster across different AZs.

Upgrade process

Disable select queries on vmauth to AZ1.
Wait until traffic is drained (? or simple timeout is enough).
Start the cluster upgrade and wait until its status is operational.
Monitor vmagent`s persisted queues for this cluster. Once queues are drained, wait an additional minute to confirm they remain empty.
Re-enable select queries on vmauth to AZ1.
Repeat the same steps for AZ2, AZ3, etc.

The operator requires network access to vmagents in order to scrape their peristed queue size metrics. Decisions are based solely on the state of vmagent persisted queues.

Non-goals

No automatic rollback on OOMs, crashes, or failures beyond initial Kubernetes controller startup checks.
No monitoring of select/ingest errors, and no rollback in case of failures.
Operator intervention may still be required. A pause=true option will allow the operator to stop automation, take manual control, and resolve issues.
The mechanism only works with vmagent; other collectors are not supported.

Sep 23 '25 13:09 makasim

This will be introduced as a new custom resource, DistributedCluster(TBD)

I wonder if this can be implemented by extending VMCluster instead. Similar to topology-aware routing in k8s, the input could be a label we could use to distinguish zones (the default is topology.kubernetes.io/zone). This would enable us to be host-granular, too. It seems to be a useful feature even for smaller deployments.

UPD: unless we want VMClusterDistributed to be referencing multiple existing VMClusters and considering each one of them to belong to a particular zone?

Upgrade process Wait until traffic is drained (? or simple timeout is enough).

I suppose we can fetch traffic stats from VMAgent metrics. Is there any other way for it to signal that no traffic is being sent?

Monitor vmagent`s persisted queues for this cluster.

This also happens via vmagent metrics?

Oct 09 '25 14:10 vrutkovs

The plan to implement this:

Create a new VMClusterDistributed CRD
Its spec would require the following settings:
- clusterVersion - default images tag for all components
- vmclusters - a list of vmclusters to update
- vmauth - name of the global vmauth used to switch read traffic to vmclusters
Upgrade status is recorded in this CRD status
Record VMCluster generations in status
When clusterVersion changes, the controller will randomly shuffle the vmcluster list and for each will do the following:
- Update spec.vmauth configuration to avoid reading from this cluster
- Check vmagent metrics from this cluster and wait for the queue to be empty
- Update vmcluster to spec.clusterVersion
- Wait for this cluster to be updated
- Switch this cluster back in vmauth config
Repeat this until all clusters are updated
When an error occurs retry as long as vmcluster generation matches the generation recorded in the status

Current limitations:

all clusters are considered to be each "zone", no way to group them and parallelize
all resources should be operator-managed in order to be discoverable (i.e., no way to implement this for helm-managed vmauth)
some timeouts may be hardcoded

Oct 13 '25 11:10 vrutkovs

Discussed with @f41gh7 today, more cases to cover:

one VMAgent CR ref per distributed cluster, no need to specify one VMAgent ref per VMCLuster CR
each VMCluster may define a custom VMUser CR
more maintenance options - not just clusterVersion, but any overrideSpec
overrideSpec should be present at a global and zone-level (i.e., global version and zone-specific replica count)
Two options for object definition:
- full spec which creates a CR with a hardcoded name
- existing object reference + overrideSpec
Object definitions apply to VMCluster, VMUser, and VMAgent

Oct 22 '25 11:10 vrutkovs