vsphere-csi-driver
vsphere-csi-driver copied to clipboard
When a Vmware node goes down, Kubernetes is unaware and PV isn't released
Migrated from https://github.com/kubernetes/cloud-provider-vsphere/issues/185
Initially filed by: @d4larso
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug
What happened: If a Vmware node with a PV crashes, is shutdown or halted, Kubernetes is unaware and doesn't release the a pod's claim (PVC) on the PV.
What you expected to happen: The PV should become "unattached" so the node and can be picked up for use by a Pod on another node.
How to reproduce it (as minimally and precisely as possible): Shutdown or halt a vmware VM and note that the associated PV can't be used by other nodes until the claim on the PV is deleted.
Anything else we need to know?:
Environment:
vsphere-cloud-controller-manager version: vSphere Client version 6.7.0.30000 OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS Kernel (e.g. uname -a): Linux us03479-vsp1-m01 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Install tools: Others:
/assign
Currently looking at this issue...
I recreated this issue in my environment on 1.14.1 and it comes down to a bug found in k/k that does not call for a release when the node goes down. This has been observed with other storage providers. According to the k/k issue, people are experiencing this on 1.13.X as well.
Addressing this is dependent on this issue getting resolved: https://github.com/kubernetes/kubernetes/issues/51835
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
@dvonthenen Unfortunately this is less of a bug than it is a limitation of the current design. The AttachDetachController is being very careful to wait for a confirmed unmount before detaching a volume from a node. That confirmation cannot come if the kubelet is unavailable.
In the special case where the VM is currently powered off, it is safe to say the volume is not mounted. However, determining power state of a VM is provider-specific, and not exposed in a common API.
Take a look at https://github.com/kubernetes/enhancements/pull/1116
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
@yastij I'm not sure if you are still looking at this issue upstream or if there is someone else. If you shouldn't be assigned this issue... if you can help me find the correct person, that would be awesome and would appreciate it.
/assign @yastij
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen
. Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen /remove-lifecycle rotten /lifecycle frozen
@yastij: Reopened this issue.
In response to this:
/reopen /remove-lifecycle rotten /lifecycle frozen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The workaround for this is to cordon+drain the node on failure after X minutes. This can be automated by a simple controller script loop that watches for node conditions.
We are also facing this issue
/assign