cluster-api-provider-vsphere icon indicating copy to clipboard operation
cluster-api-provider-vsphere copied to clipboard

Proposal for Native Support of Datastore Clusters in CAPV for Optimal Storage Placement

Open zhanggbj opened this issue 1 year ago • 4 comments

/kind feature

Describe the solution you'd like Background Currently, when using a Storage Policy and Datastore Cluster to clone VM, CAPV will fetch the compatible datastores and randomly selects a compatible one , see code logic here. However, there's a bug which leds to unexpected behavior, as CAPV considers a Datastore Cluster also as a compatible datastore. This issue has been reported in #1914 and #1853. It can be quickly addressed by fixing the compatibility check in PR #1937. However, a more significant concern is that CAPV is not fully leveraging the capabilities of Datastore Clusters , which are designed to manage storage by leveraging features such as Storage DRS that can provide storage placement recommendations that consider various constraints, including space usage, affinity rules, and datastore maintenance mode. The current random selection approach in CAPV does not ensure optimal datastore placement and does not utilize the capabilities of Datastore Clusters effectively.

Proposal This feature request aims to enhance CAPV by adding native support for Datastore Clusters using the underlying govmomi object StorageResourceManager. This enhancement will enable CAPV to leverage the placement recommendations provided by underlying Storage DRS based on specified constraints and objectives. By these recommendations, CAPV can select one of the recommended datastores to clone VM. This enhancement will ensure that CAPV fully utilizes the capabilities of Datastore Clusters, enabling users to take full advantage of storage placement while maintaining compatibility with storage policies.

Benefits Improved Datastore Placement: By leveraging Datastore Clusters and Storage DRS placement recommendations, CAPV can ensure optimal placement of VM disk based on various constraints, resulting in better storage utilization and performance.

Implementation Details

  • Get Datastore Cluster based on Storage Policy

  • Leveraging StorageResourceManager object to retrieve placement recommendations from a Datastore Cluster.

		datastoreCluster, err := ctx.Session.Finder.DatastoreCluster(ctx, "DatastoreClusterZhg")
		if err != nil {
			return errors.Wrapf(err, "unable to get datastore %s for %q", ctx.VSphereVM.Spec.Datastore, ctx)
		}
		storagePodRef := types.NewReference(datastoreCluster.Reference())

		// Build pod selection spec from config spec
		podSelectionSpec := types.StorageDrsPodSelectionSpec{
			StoragePod: storagePodRef,
		}

		folderRef := folder.Reference()

		vmRef := tpl.Reference()
		// Build the placement spec
		storagePlacementSpec := types.StoragePlacementSpec{
			Folder: 		  &folderRef,
			Vm:               &vmRef,
			CloneName:        ctx.VSphereVM.Name,
			CloneSpec:        &spec,
			PodSelectionSpec: podSelectionSpec,
			Type:             string(types.StoragePlacementSpecPlacementTypeClone),
		}

		// Get the storage placement result
		storageResourceManager := object.NewStorageResourceManager(ctx.Session.Client.Client)
		result, err := storageResourceManager.RecommendDatastores(ctx, storagePlacementSpec)
		if err != nil {
			return fmt.Errorf("couldn't get recommended datastores: %s by storage resource manager, err: %v", storagePlacementSpec, err)
		}

		// Get the recommendations
		recommendations := result.Recommendations
		if len(recommendations) == 0 {
			return fmt.Errorf("no datastore-cluster recommendations")
		}

		// Get the first recommendation
		datastoreRef = &recommendations[0].Action[0].(*types.StoragePlacementAction).Destination
	}

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Discussion for of DatastoreCluster and StorageResourceManagersupport in govmomi :

Environment:

  • Cluster-api-provider-vsphere version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

zhanggbj avatar Jun 07 '23 06:06 zhanggbj

@zhanggbj Q: is this already done?

sbueringer avatar Aug 21 '23 11:08 sbueringer

Took a quick try with StorageResourceManager in CAPV, there's a problem with fullclone mode (VirtualMachineRelocateDiskMoveOptionsMoveAllDiskBackingsAndConsolidate), it needs more investigation about the failure. Please find more details in https://github.com/vmware/govmomi/issues/3138

Clone mode: When I try to getrecommendDatastore with VirtualMachineCloneSpec.Location.DiskMoveType as VirtualMachineRelocateDiskMoveOptionsMoveAllDiskBackingsAndAllowSharing and VirtualMachineRelocateDiskMoveOptionsCreateNewChildDiskBacking, they are working well. However, I encountered an error while attempting move option as VirtualMachineRelocateDiskMoveOptionsMoveAllDiskBackingsAndConsolidate. The specific error message is as follows: "err: ServerFaultCode: A specified parameter was not correct: diskMoveType'" This behavior deviates from my expectations, in our scenario, we're using this type as a VSphere full clone mode.

CC @sbueringer

zhanggbj avatar Sep 04 '23 05:09 zhanggbj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 27 '24 12:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 26 '24 13:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 27 '24 14:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 27 '24 14:03 k8s-ci-robot