vsphere-csi-driver icon indicating copy to clipboard operation
vsphere-csi-driver copied to clipboard

When a Vmware node goes down, Kubernetes is unaware and PV isn't released

Open dvonthenen opened this issue 5 years ago • 16 comments

Migrated from https://github.com/kubernetes/cloud-provider-vsphere/issues/185

Initially filed by: @d4larso

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: If a Vmware node with a PV crashes, is shutdown or halted, Kubernetes is unaware and doesn't release the a pod's claim (PVC) on the PV.

What you expected to happen: The PV should become "unattached" so the node and can be picked up for use by a Pod on another node.

How to reproduce it (as minimally and precisely as possible): Shutdown or halt a vmware VM and note that the associated PV can't be used by other nodes until the claim on the PV is deleted.

Anything else we need to know?:

Environment:

vsphere-cloud-controller-manager version: vSphere Client version 6.7.0.30000 OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS Kernel (e.g. uname -a): Linux us03479-vsp1-m01 4.4.0-145-generic #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Install tools: Others:

dvonthenen avatar May 09 '19 20:05 dvonthenen

/assign

dvonthenen avatar May 09 '19 20:05 dvonthenen

Currently looking at this issue...

dvonthenen avatar May 09 '19 20:05 dvonthenen

I recreated this issue in my environment on 1.14.1 and it comes down to a bug found in k/k that does not call for a release when the node goes down. This has been observed with other storage providers. According to the k/k issue, people are experiencing this on 1.13.X as well.

Addressing this is dependent on this issue getting resolved: https://github.com/kubernetes/kubernetes/issues/51835

dvonthenen avatar May 10 '19 19:05 dvonthenen

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Aug 08 '19 20:08 fejta-bot

/remove-lifecycle stale

d4larso avatar Aug 13 '19 17:08 d4larso

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 11 '19 17:11 fejta-bot

@dvonthenen Unfortunately this is less of a bug than it is a limitation of the current design. The AttachDetachController is being very careful to wait for a confirmed unmount before detaching a volume from a node. That confirmation cannot come if the kubelet is unavailable.

In the special case where the VM is currently powered off, it is safe to say the volume is not mounted. However, determining power state of a VM is provider-specific, and not exposed in a common API.

Take a look at https://github.com/kubernetes/enhancements/pull/1116

misterikkit avatar Nov 14 '19 23:11 misterikkit

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot avatar Dec 15 '19 00:12 fejta-bot

@yastij I'm not sure if you are still looking at this issue upstream or if there is someone else. If you shouldn't be assigned this issue... if you can help me find the correct person, that would be awesome and would appreciate it.

/assign @yastij

dvonthenen avatar Dec 15 '19 01:12 dvonthenen

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot avatar Jan 14 '20 01:01 fejta-bot

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 14 '20 01:01 k8s-ci-robot

/reopen /remove-lifecycle rotten /lifecycle frozen

yastij avatar Jan 14 '20 09:01 yastij

@yastij: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten /lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 14 '20 09:01 k8s-ci-robot

The workaround for this is to cordon+drain the node on failure after X minutes. This can be automated by a simple controller script loop that watches for node conditions.

svrc avatar Feb 12 '20 18:02 svrc

We are also facing this issue

vasicvuk avatar Jun 12 '20 06:06 vasicvuk

/assign

briantopping avatar Mar 07 '23 19:03 briantopping