vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

Relocation of volumes between datastores

Open farodin91 opened this issue 4 years ago • 22 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Is there a way to migrate CSI vSphere volumes between datastores? We are not using vSan or Tanzu.

One idea would be to use CSI cloning.

Any idea? To help. Is their way to do it manual?

farodin91 avatar Jan 04 '21 09:01 farodin91

You can use the standard disk vmotion procedure (Migrate->Change Storage only and enable Configure per disk) as you would normally do with a mounted disk of a Virtual Machine. The tricky part is to identify which is the actual disk so you can then migrate ...

You use a kubectl plugin named vtopology which will map pv name to disk id or you can go the hard way and use powercli for quering vsphere ...

The output of vtopology is like ` === Storage Policy (SPBM) information for PV pvc-e74383ca-fad3-479d-b91e-b283d9e872a0 ===

    Kubernetes VM/Node :  k8s-plus-wrk05-bt.lab.up
    Hard Disk Name     :  Hard disk 30
    Policy Name        :  silver
    Policy Compliance  :  compliant

`

So you can then vmotion the particular disk (e.g: Hard disk 30) to another Datastore.

Hope this helps and also for some official Documentation on actions like these.

achontzo avatar Jan 13 '21 13:01 achontzo

I would like to have this feature if this correctly what i want. We have two storageclasses with datastore on two different nfs stores.

My plan is migrate between both of storageclasses.

farodin91 avatar Jan 13 '21 13:01 farodin91

@farodin91 I'm trying to understand why you want to relocate the volumes from one datastore to another. Is it because you want to decommission the datastore, or is it because you want to balance the capacity between the two datastores?

SandeepPissay avatar Apr 06 '21 22:04 SandeepPissay

Is it because you want to decommission the datastore, or is it because you want to balance the capacity between the two datastores?

I want to decommission datastores.

farodin91 avatar Apr 07 '21 06:04 farodin91

Is it because you want to decommission the datastore, or is it because you want to balance the capacity between the two datastores?

I want to decommission datastores.

Did my solution worked out for you?

achontzo avatar Apr 07 '21 07:04 achontzo

@achontzo We tried out vmotion and it worked with anartifact on the datastore. Before the fcd was in Directory which was called fcd, now it's in a folder which is called like the originating VM. On the k8s side, we have to manual patch the storageclass.

We started to tried out the CnsRelocate command, but we got a wired error that relocate is disabled.

farodin91 avatar Apr 09 '21 18:04 farodin91

We have heard the following storage vMotion requirements for CNS volumes:

  1. Capacity load balancing between storage (could be mixed datastore types like VMFS, NFS, vSAN, vVol).
  2. Datastore maintenance mode support so that all the CNS volumes can be storage vMotioned out of the datastore which will be decommissioned or prepare for firmware upgrade, etc
  3. Storage vMotion volumes from one datastore to another that could be in a different datacenter.

@farodin91 Could you validate if this captures your requirements?

SandeepPissay avatar May 05 '21 23:05 SandeepPissay

@SandeepPissay For our case mainly, 2 and 3 would best match for our requirements.

farodin91 avatar May 19 '21 05:05 farodin91

@SandeepPissay For our case mainly, 2 and 3 would best match for our requirements.

@farodin91 regarding requirement (3), do you have separate vCenters managing the datacenters or a single vCenter? I'm wondering if we are looking at cross vCenter vMotion.

SandeepPissay avatar May 19 '21 06:05 SandeepPissay

We have just a single vCenter in this case.

farodin91 avatar May 19 '21 06:05 farodin91

@achontzo We tried out vmotion and it worked with anartifact on the datastore. Before the fcd was in Directory which was called fcd, now it's in a folder which is called like the originating VM. On the k8s side, we have to manual patch the storageclass.

We have actually the same issue which bit us pretty hard...

We had Storage DRS vMotion FCDs on our datastore cluster, the VMDKs afterwards being directly in a VM's folder instead of the fcd folder.

We are also using Cluster API and whenever the VM where the VMDK was attached to gets killed (because of an upgrade for example, which provisions new VMs and kills the old ones) the affected PVs are broken and cannot be used as CNS disks anymore.

There needs to be some warning sign somewhere Don't use Storage vMotion/DRS with CNS volumes, or they will break

marratj avatar Jul 05 '21 09:07 marratj

For what it’s worth , I am curious @marratj what you mean by the PVs getting broken? The PVs are FCDs under the covers and vcenter maintains the link to the VMDK even after it is moved by sVM.

We had tested sVM with TKGI and CSI back in 2019 and had no issues moving PVs across nodes during upgrades. we found that the old PV VMDKs would be in the old VM folder even after that VM was deleted. Perhaps CAPV is doing something odd on VM delete with its attached volumes? the BOSH CPI just does a mass detach of all volumes (BOSH or foreign like a K8s PV) prior to VM deletion.

svrc avatar Jul 27 '21 05:07 svrc

@SandeepPissay speaking from what I’ve seen in the past, other situations with our BOSH experience with TAS and TKGI

  • different datastores on different compute clusters, different datacenters (ideally some day with different vcenters)
  • “shared nothing” cases ie. VSAN / nutanix, where the compute clusters can’t see each other’s Datastores and thus the sVM data transfer happens over the network rather than shared storage
  • need to handle VM deletion after sVM - ie. move the sVM’d VMDK back to a predictable folder on the datastore (eg. “fcd”) so it can be in a known location rather than stale VM folders

svrc avatar Jul 27 '21 05:07 svrc

@svrc "broken" means that the CSI driver cannot mount the volume anymore.

(*types.LocalizedMethodFault)(0xc0009abba0)({\n DynamicData: (types.DynamicData) {
\n },\n Fault: (*types.NotFound)(0xc0009abbc0)({\n VimFault: (types.VimFault) {\n MethodFault: (types.MethodFault) {\n FaultCause: (*types.LocalizedMethodFault)(nil),\n FaultMessage: ([]types.LocalizableMessage) nil\n }\n
}\n }),\n LocalizedMessage: (string) (len=50) \"The object or item referred to could not be found.\"\n})\n". opId: "72b115b

The thing is that SDRS moves the VMDK files out of its original fcd folder where it was created into the VM-specific folder on the new datastore it's being migrated to; new datastore, new folder, new VMDK name even (e.g. it gets renamed from fcd/839395e8712e46f285d309818e0eb22f.vmdk to vmname/vmname_2.vmdk during the migration to the new datastore).

We already were in contact with VMware support about this and they confirmed that Storage DRS breaks the CNS/FCD relationship due to this in a way that the CSI driver cannot find the volume anymore and the only way for now is to keep SDRS disabled.

marratj avatar Aug 03 '21 07:08 marratj

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 01 '21 08:11 k8s-triage-robot

/remove-lifecycle stale

tgelter avatar Nov 01 '21 14:11 tgelter

@marratj did your recover the disk that were moved by DRS ? how?

McAndersDK avatar Dec 16 '21 19:12 McAndersDK

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 16 '22 19:03 k8s-triage-robot

/remove-lifecycle stale

tgelter avatar Mar 16 '22 20:03 tgelter

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 14 '22 21:06 k8s-triage-robot

/remove-lifecycle stale

neuromantik33 avatar Jun 19 '22 20:06 neuromantik33

I've recently been dealing with this as well, and was able to get around it, after some troubleshooting. This may not solve others' issues but I wanted to share. In my case, I was getting errors when vCenter tried to detach volumes, and it was due to there being a snapshot of the backing vm that was associated with the mount. As soon as I deleted the snapshot, all my errors went away, the mount detached/attached as intended, and kubernetes was happy again.

De1taE1even avatar Aug 24 '22 18:08 De1taE1even

This feature missing renders any realistic vSphere CSI use-case broken. You can't even migrate data to a new datastore when it get's decomissioned. On virtualized infrastructure thats daily operations. @svrc's question is very valid: it's unclear why the CSI isn't able to find migrated FCDs after they have been moved. Wasn't the hole point of FCDs to make VMDKs identifyable by moref/moid/uuid just like any other ManagedObject in the vSphere API? Why are display name (!) paths used to identify the relevant objects for the CSI (node VMs, FCDs)? Would be really interested in the design decision behind that.

omniproc avatar Oct 25 '22 07:10 omniproc

Is https://github.com/vmware-samples/cloud-native-storage-self-service-manager fixing this problem ? I have the feeling that's the case

gn-smals avatar Nov 29 '22 09:11 gn-smals

Yes, we have CNS Self Service Manager available to help relocate volume from one datastore to another datastore. Refer to

  • https://github.com/vmware-samples/cloud-native-storage-self-service-manager/releases/tag/v0.1.0
  • https://github.com/vmware-samples/cloud-native-storage-self-service-manager/blob/main/docs/book/features/storage_vmotion.md

divyenpatel avatar Nov 30 '22 20:11 divyenpatel

Actually, this tool leads to the exact same issue with FCDs landing in the wrong folder on the new Datastore https://github.com/vmware-samples/cloud-native-storage-self-service-manager/issues/19

hc2p avatar Jul 05 '23 11:07 hc2p

FYI at least in vSphere 8.x (and maybe in some 7.x U patch too) you can perform a FCD migration right from the UI in the CNS volumes view of the vSphere Cluster. The mentioned cns-self-service tool is not worth your time...

omniproc avatar Dec 30 '23 12:12 omniproc