cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard
Controlplane VMs in vSphere Cluster sometimes land in wrong DC due to Storage Policy
/kind bug
We're using CAPV to deploy a workload cluster across two data centers (DC1 and DC2) within a stretched vSphere cluster.
Control plane nodes are assigned to failure domains correctly via the vspherecluster resource and placed in the appropriate VM groups (DC1 or DC2).
Worker nodes behave as expected because they use separate vspheretemplates with storage policies scoped to their respective DCs.
Control plane nodes, however, share a single vspheretemplate. This template uses a storage policy that targets all datastores across both DCs.
Occasionally, a control plane VM that should be running in DC1 is placed in the DC1 VM group, but the actual storage is provisioned in a datastore located in DC2. As a result, the entire VM ends up running in the wrong DC.
We believe this occurs because vSphere places the VM based on where the datastore is actually provisioned, which is currently not restricted tightly enough due to the shared storage policy in the control plane template.
Expected Behavior: Control plane VMs should be entirely placed—including compute and storage—in the same data center as their assigned failure domain.
Actual Behavior: Control plane VMs occasionally land in the wrong physical data center due to storage being provisioned from the opposite DC.
Environment:
- Cluster-api-provider-vsphere version: 1.13.0
- Kubernetes version: (use
kubectl version): v1.31.5 - OS (e.g. from
/etc/os-release): Ubuntu 22.04.5 LTS
Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain .spec.topology.datastore.
I know its a different way, but would it result in a valid scenario or are there issues to that?
Did you consider setting the datastore instead for the failuredomains for the control plane? (Should be part of VSphereFailureDomain
.spec.topology.datastore.I know its a different way, but would it result in a valid scenario or are there issues to that?
Thanks for your reply. That does work but we're forced to use a storage policy which doesn't work with failuredomains.
So the solution here would be to be able to set a storage policy via the failure domain?
I guess the storage policy cannot be the same for all kcp vms (independent of the failure domain)? If yes you could already set it in the vsphere machine template.
Yes it would work if i would be able to specify a storage policy in the vspherefailuredomain but that's not allowed.
If i use a storage policy which returns all datastores in both dcs i sometimes end up with a vm in dc01 with its storage in dc02.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Sounds reasonable for me to allow this setting in via failure domains.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.