vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

Need guidance regarding datastore space exhaustion avoidance

Open tgelter opened this issue 4 years ago • 13 comments
trafficstars

/kind feature

What happened: A datastore filled up when large PVC-related disks were heavily used. Thick-provisioned FCD and/or storage DRS may have prevented this issue from occurring.

What you expected to happen: The disks backing PVC volumes which are managed by the vSphere CSI Driver should be thick-provisioned, Storage DRS should be supported, and/or the driver should provide a gate to prevent storage over-subscription (but not necessarily over-provisioning).

How to reproduce it (as minimally and precisely as possible):

  • Create a datastore cluster with 2x10TB datastores
  • Configure a StorageClass & VM disk placement policy allowing for PVC volume creation on the datastore cluster
  • Create some 4Ti PVCs and use up the disk space on these volumes via the Pods which have access to the filesystems
  • Observe that VMs get paused, asking for space to be freed up before resuming them

Anything else we need to know?: We happen to be running 3x replicated Kafka clusters across 3 StorageClasses which leverage distinct datastore clusters. In one case, 3x4Ti disks were located on a single datastore (the other being left unused), which along with lack of thick-provisioning and/or Storage DRS support, led to impact to Kafka.

Environment:

  • csi-vsphere version: v2.2.0
  • vsphere-cloud-controller-manager version: CPI v1.19
  • Kubernetes version: 1.19
  • vSphere version: 7.0U2
  • OS (e.g. from /etc/os-release): Flatcar Container Linux by Kinvolk 2605.12.0 (Oklo)
  • Kernel (e.g. uname -a): 5.4.92-flatcar
  • Install tools: Home-grown Python (vanilla Kubernetes + vSphere CSI Driver + vSphere CPI)
  • Others: Please ask

tgelter avatar Aug 04 '21 17:08 tgelter

@tgelter we are working on supporting thick provisioned volumes. We do not know when this will be out in a vSphere release.

One question - what do you mean by over subscription?

SandeepPissay avatar Aug 19 '21 00:08 SandeepPissay

@tgelter we are working on supporting thick provisioned volumes. We do not know when this will be out in a vSphere release.

One question - what do you mean by over subscription?

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.storage.doc/GUID-D18A4449-6C05-49C1-BE5D-3AAE29F0A681.html Basically, whenever we're thin-provisioning volumes, it's possible to provision more storage to Virtual Machines than storage space available for consumption. When usage/consumption consumes 100% available capacity, VMs are paused. If there's more info I can provide, please let me know!

tgelter avatar Aug 19 '21 16:08 tgelter

@SandeepPissay, we're nearly running into this issue again and don't know how to avoid it in the future, short of tediously modifying Datastore StorageType tags as volumes are being created in the environment. Can we please get an update on support for thick-provisioning vSphere CSI-managed volumes?

tgelter avatar Oct 19 '21 20:10 tgelter

@tgelter I understand your situation, many VMware customers have asked this feature and it is high on our priority list. We are working on enhancing our storage stack in vSphere to support this feature. But I do not know in which vSphere release it will be supported. In general, we also do not share such info in public. Maybe you should get in touch with VMware accounts team and they may be able to get that info.

SandeepPissay avatar Oct 19 '21 22:10 SandeepPissay

@tgelter I understand your situation, many VMware customers have asked this feature and it is high on our priority list. We are working on enhancing our storage stack in vSphere to support this feature. But I do not know in which vSphere release it will be supported. In general, we also do not share such info in public. Maybe you should get in touch with VMware accounts team and they may be able to get that info.

Thanks @SandeepPissay, our VMware team has submitted SR #21270088710 for thin-provisioned PVC volume support (also SR #21269099510 for storage vMotion support) to hopefully help get this on the the near-term roadmap. We appreciate all the great work being done by the team, but filled datastores are particularly impactful, so I'm sure you can appreciate our concern.

tgelter avatar Oct 26 '21 16:10 tgelter

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 24 '22 17:01 k8s-triage-robot

/remove-lifecycle stale

tgelter avatar Jan 24 '22 17:01 tgelter

@tgelter I understand your situation, many VMware customers have asked this feature and it is high on our priority list. We are working on enhancing our storage stack in vSphere to support this feature. But I do not know in which vSphere release it will be supported. In general, we also do not share such info in public. Maybe you should get in touch with VMware accounts team and they may be able to get that info.

Thanks @SandeepPissay, our VMware team has submitted SR #21270088710 for thin-provisioned PVC volume support (also SR #21269099510 for storage vMotion support) to hopefully help get this on the the near-term roadmap. We appreciate all the great work being done by the team, but filled datastores are particularly impactful, so I'm sure you can appreciate our concern.

Hello @SandeepPissay et al., We saw another instance of this issue occur yesterday, despite efforts to avoid the issue (free space monitoring, storage placement policy to storage pod with several large datastores, education with users of PVCs, etc.). Until we have support for Storage DRS and thick-provisioned volumes, there's not much more we can do to avoid impacting our users. Note that impact to users can often be severe (VM pauses, inability to write data, slow recovery time (due to blocks needing to be replicated across ethernet networks, etc.). Can we please get an updated estimate regarding when we can expect VMware to address the concerns outlined in this issue?

tgelter avatar Mar 15 '22 15:03 tgelter

@tgelter I understand your situation, many VMware customers have asked this feature and it is high on our priority list. We are working on enhancing our storage stack in vSphere to support this feature. But I do not know in which vSphere release it will be supported. In general, we also do not share such info in public. Maybe you should get in touch with VMware accounts team and they may be able to get that info.

Thanks @SandeepPissay, our VMware team has submitted SR #21270088710 for thin-provisioned PVC volume support (also SR #21269099510 for storage vMotion support) to hopefully help get this on the the near-term roadmap. We appreciate all the great work being done by the team, but filled datastores are particularly impactful, so I'm sure you can appreciate our concern.

Hello @SandeepPissay et al., We saw another instance of this issue occur yesterday, despite efforts to avoid the issue (free space monitoring, storage placement policy to storage pod with several large datastores, education with users of PVCs, etc.). Until we have support for Storage DRS and thick-provisioned volumes, there's not much more we can do to avoid impacting our users. Note that impact to users can often be severe (VM pauses, inability to write data, slow recovery time (due to blocks needing to be replicated across ethernet networks, etc.). Can we please get an updated estimate regarding when we can expect VMware to address the concerns outlined in this issue?

@tgelter We are looking at vSphere 8.0 timeframe to support thick volume provisioning. In parallel, we are working on tooling to move (svMotion) volumes from one datastore to another. Please get in touch with the engineering and PM teams via VMware accounts team for details.

SandeepPissay avatar Mar 15 '22 17:03 SandeepPissay

@tgelter I understand your situation, many VMware customers have asked this feature and it is high on our priority list. We are working on enhancing our storage stack in vSphere to support this feature. But I do not know in which vSphere release it will be supported. In general, we also do not share such info in public. Maybe you should get in touch with VMware accounts team and they may be able to get that info.

Thanks @SandeepPissay, our VMware team has submitted SR #21270088710 for thin-provisioned PVC volume support (also SR #21269099510 for storage vMotion support) to hopefully help get this on the the near-term roadmap. We appreciate all the great work being done by the team, but filled datastores are particularly impactful, so I'm sure you can appreciate our concern.

Hello @SandeepPissay et al., We saw another instance of this issue occur yesterday, despite efforts to avoid the issue (free space monitoring, storage placement policy to storage pod with several large datastores, education with users of PVCs, etc.). Until we have support for Storage DRS and thick-provisioned volumes, there's not much more we can do to avoid impacting our users. Note that impact to users can often be severe (VM pauses, inability to write data, slow recovery time (due to blocks needing to be replicated across ethernet networks, etc.). Can we please get an updated estimate regarding when we can expect VMware to address the concerns outlined in this issue?

@tgelter We are looking at vSphere 8.0 timeframe to support thick volume provisioning. In parallel, we are working on tooling to move (svMotion) volumes from one datastore to another. Please get in touch with the engineering and PM teams via VMware accounts team for details.

Thanks again @SandeepPissay. We're in touch with VMware through regular support channels as well. We'll keep an eye in both places for news about this issue. Enjoy your day!

tgelter avatar Mar 15 '22 17:03 tgelter

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 13 '22 18:06 k8s-triage-robot

/remove-lifecycle stale

tgelter avatar Jul 11 '22 19:07 tgelter

@SandeepPissay any news on this tool to allow moving between datastore clusters? According to our rep this is close to release and as such we are unable to get access to the tool. Our use case is trying to migrate to new datastores on a new datastore cluster. Seems that the association in CNS keeps the old FCD path on 7u3.

braunsonm avatar Sep 16 '22 19:09 braunsonm

/lifecycle frozen

tgelter avatar Nov 29 '22 22:11 tgelter

@tgelter we added support for thick volume provisioning in vSphere 8.0 for VMFS datastores. See https://core.vmware.com/resource/whats-new-vsphere-8-core-storage#sec21636-sub5. With this, you should be able to define an SPBM policy with FullyInitialized capability and use this policy in Kubernetes's StorageClass. Note that the existing volumes need to be offline if you want to convert existing thin volumes to thick volumes.

For moving CNS volumes from one datastore to another, we released CNS Manager - https://github.com/vmware-samples/cloud-native-storage-self-service-manager

SandeepPissay avatar Nov 30 '22 20:11 SandeepPissay

@tgelter we added support for thick volume provisioning in vSphere 8.0 for VMFS datastores. See https://core.vmware.com/resource/whats-new-vsphere-8-core-storage#sec21636-sub5. With this, you should be able to define an SPBM policy with FullyInitialized capability and use this policy in Kubernetes's StorageClass. Note that the existing volumes need to be offline if you want to convert existing thin volumes to thick volumes.

For moving CNS volumes from one datastore to another, we released CNS Manager - https://github.com/vmware-samples/cloud-native-storage-self-service-manager

I was unaware of the new SPBM policy option. This looks like great news! I'll check with our virtualization team to make sure they are aware of this and to start testing with them to determine if the combination of tools alleviate concerns from our side. Thanks very much!

tgelter avatar Nov 30 '22 22:11 tgelter

@tgelter do you think we can close this issue now since you have a solution to create thick volumes and also move them between datastores?

SandeepPissay avatar Nov 30 '22 22:11 SandeepPissay

@SandeepPissay, I checked with our internal teams on this. The feeling is that we won't have time to test in the roadmap until Q1 next year, so go ahead and close this issue. Thanks very much to all involved in making very important improvements to the enterprise vSphere CSI experience!

tgelter avatar Dec 01 '22 18:12 tgelter