cluster-api-provider-vsphere Controlplane VMs in vSphere Cluster sometimes land in wrong DC due to Storage Policy

/kind bug

We're using CAPV to deploy a workload cluster across two data centers (DC1 and DC2) within a stretched vSphere cluster.

Control plane nodes are assigned to failure domains correctly via the vspherecluster resource and placed in the appropriate VM groups (DC1 or DC2).

Worker nodes behave as expected because they use separate vspheretemplates with storage policies scoped to their respective DCs.

Control plane nodes, however, share a single vspheretemplate. This template uses a storage policy that targets all datastores across both DCs.

Occasionally, a control plane VM that should be running in DC1 is placed in the DC1 VM group, but the actual storage is provisioned in a datastore located in DC2. As a result, the entire VM ends up running in the wrong DC.

We believe this occurs because vSphere places the VM based on where the datastore is actually provisioned, which is currently not restricted tightly enough due to the shared storage policy in the control plane template.

Expected Behavior: Control plane VMs should be entirely placed—including compute and storage—in the same data center as their assigned failure domain.

Actual Behavior: Control plane VMs occasionally land in the wrong physical data center due to storage being provisioned from the opposite DC.

Environment:

Cluster-api-provider-vsphere version: 1.13.0
Kubernetes version: (use kubectl version): v1.31.5
OS (e.g. from /etc/os-release): Ubuntu 22.04.5 LTS

May 30 '25 20:05 pslijkhuis

Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain .spec.topology.datastore.

I know its a different way, but would it result in a valid scenario or are there issues to that?

Jun 03 '25 07:06 chrischdi

Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain .spec.topology.datastore.

I know its a different way, but would it result in a valid scenario or are there issues to that?

Thanks for your reply. That does work but we're forced to use a storage policy which doesn't work with failuredomains.

Jun 03 '25 11:06 pslijkhuis

So the solution here would be to be able to set a storage policy via the failure domain?

I guess the storage policy cannot be the same for all kcp vms (independent of the failure domain)? If yes you could already set it in the vsphere machine template.

Jun 04 '25 06:06 chrischdi

Yes it would work if i would be able to specify a storage policy in the vspherefailuredomain but that's not allowed.

If i use a storage policy which returns all datastores in both dcs i sometimes end up with a vm in dc01 with its storage in dc02.

Jun 06 '25 17:06 pslijkhuis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 04 '25 18:09 k8s-triage-robot

Sounds reasonable for me to allow this setting in via failure domains.

Sep 09 '25 11:09 chrischdi

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 09 '25 12:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 08 '25 12:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 08 '25 12:11 k8s-ci-robot

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

Controlplane VMs in vSphere Cluster sometimes land in wrong DC due to Storage Policy

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard