cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

Controlplane VMs in vSphere Cluster sometimes land in wrong DC due to Storage Policy

Open pslijkhuis opened this issue 5 months ago • 4 comments

/kind bug

We're using CAPV to deploy a workload cluster across two data centers (DC1 and DC2) within a stretched vSphere cluster.

Control plane nodes are assigned to failure domains correctly via the vspherecluster resource and placed in the appropriate VM groups (DC1 or DC2).

Worker nodes behave as expected because they use separate vspheretemplates with storage policies scoped to their respective DCs.

Control plane nodes, however, share a single vspheretemplate. This template uses a storage policy that targets all datastores across both DCs.

Occasionally, a control plane VM that should be running in DC1 is placed in the DC1 VM group, but the actual storage is provisioned in a datastore located in DC2. As a result, the entire VM ends up running in the wrong DC.

We believe this occurs because vSphere places the VM based on where the datastore is actually provisioned, which is currently not restricted tightly enough due to the shared storage policy in the control plane template.

Expected Behavior: Control plane VMs should be entirely placed—including compute and storage—in the same data center as their assigned failure domain.

Actual Behavior: Control plane VMs occasionally land in the wrong physical data center due to storage being provisioned from the opposite DC.

Environment:

  • Cluster-api-provider-vsphere version: 1.13.0
  • Kubernetes version: (use kubectl version): v1.31.5
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.5 LTS

pslijkhuis avatar May 30 '25 20:05 pslijkhuis

Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain .spec.topology.datastore.

I know its a different way, but would it result in a valid scenario or are there issues to that?

chrischdi avatar Jun 03 '25 07:06 chrischdi

Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain .spec.topology.datastore.

I know its a different way, but would it result in a valid scenario or are there issues to that?

Thanks for your reply. That does work but we're forced to use a storage policy which doesn't work with failuredomains.

pslijkhuis avatar Jun 03 '25 11:06 pslijkhuis

So the solution here would be to be able to set a storage policy via the failure domain?

I guess the storage policy cannot be the same for all kcp vms (independent of the failure domain)? If yes you could already set it in the vsphere machine template.

chrischdi avatar Jun 04 '25 06:06 chrischdi

Yes it would work if i would be able to specify a storage policy in the vspherefailuredomain but that's not allowed.

If i use a storage policy which returns all datastores in both dcs i sometimes end up with a vm in dc01 with its storage in dc02.

pslijkhuis avatar Jun 06 '25 17:06 pslijkhuis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 04 '25 18:09 k8s-triage-robot

Sounds reasonable for me to allow this setting in via failure domains.

chrischdi avatar Sep 09 '25 11:09 chrischdi

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 09 '25 12:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 08 '25 12:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 08 '25 12:11 k8s-ci-robot