cluster-api-provider-vsphere VM disk placement isn't distributed across datastores when a datastore cluster is used

/kind bug

What steps did you take and what happened: env setup:

datastore cluster with more than one datastore
storage policy that targets the datastore cluster
create multiple machines using the storage policy

What did you expect to happen:

expected the machines' disks to be distributed across all the datastores.

What actually happened:

a single datastore is repeatedly targeted for the disk.

Environment:

Cluster-api-provider-vsphere version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Aug 09 '21 13:08 gab-satchi

/label triage/needs-information

Aug 16 '21 18:08 gab-satchi

@gab-satchi: The label(s) /label triage/needs-information cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda

In response to this:

/label triage/needs-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 16 '21 18:08 k8s-ci-robot

/label needs-information

Aug 16 '21 19:08 gab-satchi

@gab-satchi: The label(s) /label needs-information cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda

In response to this:

/label needs-information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 16 '21 19:08 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 14 '21 19:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Dec 14 '21 20:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Jan 13 '22 20:01 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 13 '22 20:01 k8s-ci-robot

/reopen /remove-lifecycle rotten /lifecycle frozen

Jan 28 '22 23:01 srm09

@srm09: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten /lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 28 '22 23:01 k8s-ci-robot

/milestone Next

Jan 28 '22 23:01 srm09

/help

Jan 31 '22 00:01 srm09

@srm09: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 31 '22 00:01 k8s-ci-robot

/remove-lifecycle frozen /lifecycle active

Feb 16 '23 21:02 srm09

/assign

Feb 23 '23 02:02 srm09

/unassign

Feb 23 '23 22:02 srm09

/assign I would like to work on it.

May 11 '23 08:05 zhanggbj

Some investigation about this issue:

Briefly CAPV will check if user are using specific datastore or storage policy(our case), if using storage policy, randomly pick one and the create the VM.
Based on my observation, if CAPV choose sharedVmfs-1, VSphere firstly creates a disk folder on sharedVmfs-1 but eventually moving all the disk files and the folder back to sharedVmfs-0. This is the same result as the issue reported that all are located on sharedVmfs-0, but in fact, CAPV send the right request and there is an intermediate state on sharedVmfs-1, but finally all are moved to sharedVmfs-0.

So this is not a simple bug, it contains multiple works as below:

There's a known bug that when using DatastoreCluster, CAPV will take DatastoreCluster itself also as a compatible datastore, this will lead to unexpected behavior. This can be fixed by PR #1937
Instead of choosing datastore randomly by CAPV, we should delegate this to StorageResourceManager to leverage DatastoreCluster natively. Here's a proposal in #1938
About the distribution, this may need more investigation, which may related to Storage DRS and some anti-affinity rules.

Jul 11 '23 07:07 zhanggbj

cluster-api-provider-vsphere cluster-api-provider-vsphere copied to clipboard

VM disk placement isn't distributed across datastores when a datastore cluster is used

Guidelines

cluster-api-provider-vsphere
cluster-api-provider-vsphere copied to clipboard