spec icon indicating copy to clipboard operation
spec copied to clipboard

Volume Lifecycle is not correspond to actual k8s behavior

Open YuikoTakada opened this issue 3 years ago • 8 comments

Current CSI volume lifecycle is not designed for the case when Node is unreachable.

When Node is shutdown or in a non-recoverable state such as hardware failure or broken OS, Node Plugin cannot issue NodeUnpublishVolume / NodeUnstageVolume . In this case, we want to make the status CREATED (volume is detached from the node and pods are evicted to other node and running) But in current CSI volume lifecycle, there is no transition from PUBLISHED / VOL_READY / NODE_READY to CREATED. As a result, k8s doesn't follow the CSI spec and the status moves from PUBLISHED to CREATED directly without going through VOL_READY and/or NODE_READY status.

We need to update the CSI volume lifecycle with considering the case when Node is unreachable.

YuikoTakada avatar Jun 15 '22 03:06 YuikoTakada

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP.

That said, others in the CSI community have thought long and hard about similar problems. So perhaps there's a way to amend the spec in a way that doesn't break existing drivers (or the integration w/ other platforms and said drivers).

Community feedback here is welcome.

On Tue, Jun 14, 2022 at 11:46 PM Yuiko Mouri @.***> wrote:

As non-graceful node shutdown feature has been implemented in k8s as alpha, CSI volume lifecycle is not correspond to actual k8s behavior. see:

https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown

By using non-graceful node shutdown, status can be moved from PUBLISHED to CREATED without going through VOL_READY and/or NODE_READY status. In the KEP, written as below:

  • Once pods are selected and forcefully deleted, the attachdetach reconciler should check the out-of-service taint on the node. If the taint is present, the attachdetach reconciler will not wait for 6 minutes to do force detach. Instead it will force detach right away and allow volumeAttachment to be deleted.
  • This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away.

— Reply to this email directly, view it on GitHub https://github.com/container-storage-interface/spec/issues/512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLBDKB5HOQQPN6U7A7LVPFGYVANCNFSM5YZZEDDQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- James DeFelice

jdef avatar Jun 15 '22 04:06 jdef

@jdef Thank you for your comments.

I understand what you say.

At first glance, this seems like a breaking change to the spec. As this (CSI) is not a k8s project, it is not bound by the referenced KEP.

I want to find a good solution which doesn't break existing drivers. Therefore, #477 also seems to try to solve this problem. WDYT?

YuikoTakada avatar Jun 16 '22 00:06 YuikoTakada

This would trigger the deletion of the volumeAttachment objects. For CSI drivers, this would allow ControllerUnpublishVolume to happen without NodeUnpublishVolume and/or NodeUnstageVolume being called first. Note that there is no additional code changes required for this step. This happens automatically after the Proposed change in the previous step to force detach right away.

Note that this "force detach" behavior is not introduced by Non-Graceful Node Shutdown feature. Kubernetes already supports this behavior without Non-Graceful Node Shutdown. See "Test 2" in the PR description section below. By forcefully deleting the Pods on the shutdown node manually, volumes will be force-detached after a 6 minute wait by the Attach Detach Controller.

https://github.com/kubernetes/kubernetes/pull/108486

xing-yang avatar Jun 21 '22 15:06 xing-yang

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

The last attempt to fix this is officially in CSI is in https://github.com/container-storage-interface/spec/pull/477.

jsafrane avatar Jun 23 '22 12:06 jsafrane

Yes, Kubernetes already breaks CSI spec and can call ControllerUnpublish without NodeUnpublish / NodeUnstage succeeding if Kubernetes thinks the node is broken - it can't really call NodUnstage/Unpublish in that case or get its result.

Implementor of Nomad's CSI support here! 👋 For what it's worth we originally implemented the spec as-written and it turned out to cause our users a lot of grief. As of Nomad 1.3.0 (shipped in May of this year), we're doing something similar to what k8s has done where we make a best effort attempt to NodeUnpublish / NodeUnstage before ControllerUnpublish.

We drive this from the "client node" (our equivalent of the kubelet), so if the client node is merely disconnected and not dead, we can rely on the node unpublish/unstage having happened by the time we try to GC the claim from the control plane side. The control plane ends up retrying the NodeUnstage/NodeUnpublish anyways, but proceeds on to ControllerUnpublish if it can't reach the node.

tgross avatar Sep 12 '22 13:09 tgross

Thanks @tgross.

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

Also Mesos is another CO w/ CSI integration - anyone on that side of the house have input to add here?

jdef avatar Sep 12 '22 14:09 jdef

Are there any concerns from plugin providers that may be relying on CSI-as-written vs. the best-effort described herein?

IIRC all the plugins I've tested that support ControllerPublish implement that step as "make the API call to the storage provider to attach the volume as a device on the target host" and then implement NodeStage as "call mount(1) to mount that device". So when unpublishing, if the unmount is missed, the API call will try to detach a mounted device.

I know that, for example, the AWS EBS provider just merrily returns OK to that API call and then the device doesn't actually get detached until it's unmounted. (Or the user can "force detach" via the API out-of-band.) So in that case the provider is graceful and has behavior that's eventually correct, so long as that node unpublish happens eventually. If the node unpublish never happens (say the CO has crashed unrecoverably but the host is still live), I think you end up with a hung volume. But arguably that's the right behavior. I just don't how prevalent that graceful treatment is across the ecosystem.

tgross avatar Sep 14 '22 13:09 tgross

@tgross Thank you for sharing information. I've updated description to fit CO as a whole. This issue is not only of Kubernetes.

YuikoTakada avatar Sep 20 '22 11:09 YuikoTakada