karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

PersistentVolumes stuck after node consolidation / termination

Open jmdeal opened this issue 1 year ago • 13 comments

Description

Observed Behavior: After Karpenter performs a scale-down event, PersistentVolumes may fail to be detached from the deleted nodes. As a result, pods which are dependent on these PersistentVolumes are left in a pending state until the Attach/Detach controller forcibly detaches the PV after 6 minutes.

This is related to the following upstream kubernetes issue:

  • https://github.com/kubernetes/kubernetes/issues/115148

While this is an upstream issue, there has been real impact on Karpenter users today and could potentially be addressed as part of Karpenter's node drain logic.

Expected Behavior: All persistent volumes should be detached from the node as part of the termination process.

Additional References:

  • https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1665
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

jmdeal avatar Jan 17 '24 22:01 jmdeal

On the AWS side this has been fixed via this PR to the aws-ebs-csi driver, at least for graceful terminations: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1736. However, this doesn't address the issue for CSI drivers at large. Curious if any of the Azure folks have ran into similar issues (cc @tallaxes @Bryce-Soghigian @jackfrancis)?

jmdeal avatar Jan 17 '24 23:01 jmdeal

/remove-label needs-triage

jmdeal avatar Jan 17 '24 23:01 jmdeal

@jmdeal: The label(s) /remove-label needs-triage cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jan 17 '24 23:01 k8s-ci-robot

We've got some updates to add here, at least for AWS users. This problem can be largely mitigated by ensuring your nodes are configured for gracefule node shutdown. Currently this isn't enabled by default on AL2, the following UserData should be sufficient to enable it:

#!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind

echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

Note: This is specific for the AL2 EKS optimized AMI, enabling graceful termination will differ depending on your AMI. There still may be some delays with in EBS detachment but this should solve the 6 minute detach delay caused by the k8s side of things.

jmdeal avatar Feb 13 '24 18:02 jmdeal

@jmdeal I tried use the userdata on AL2023 and the 6 minute stuck still happen 😢

levanlongktmt avatar Mar 04 '24 04:03 levanlongktmt

Any suggestions on the correct userdata for EKS Bottlerocket AMIs? Can't use what's suggested above because Bottlerocket doesn't have the systemd-logind service.

johnjeffers avatar Apr 09 '24 17:04 johnjeffers

The AWS provider has supported configuring graceful shutdown for bottlerocket since v0.31.0 (https://github.com/aws/karpenter-provider-aws/pull/4571) and as far as I can tell bottlerocket should have systemd-logind (https://github.com/bottlerocket-os/bottlerocket/pull/3308). I'm not sure if any additional configuration for systemd-logind is needed for bottlerocket, I'm going to wager no based on what I've briefly read.

jmdeal avatar Apr 09 '24 23:04 jmdeal

@levanlongktmt apologies for missing your question, you likely need to handle the kubelet configuration through the nodeadm config rather than bash. This should work:

    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.awsContent-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//--

jmdeal avatar Apr 10 '24 00:04 jmdeal

@jmdeal Thanks for sharing the example userdata! However, it contains some typos in the configuration. I can confirm that the following userdata works perfectly.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2023
  (...)
  userData: |
    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="//"

    --//
    Content-Type: application/node.eks.aws

    apiVersion: node.eks.aws/v1alpha1
    kind: NodeConfig
    spec:
      kubelet:
        config:
          shutdownGracePeriod: 45s
          shutdownGracePeriodCriticalPods: 15s
    --//
    Content-Type: text/x-shellscript; charset="us-ascii"

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    --//

A Karpenter node is configured as expected:

❯ kubectl debug -it node/ip-172-30-20-187.ap-northeast-1.compute.internal --image=alpine

/ # chroot /host
[root@ip-172-30-20-187 /]# cat /etc/systemd/logind.conf | grep InhibitDelayMaxSec
#InhibitDelayMaxSec=5
InhibitDelayMaxSec=45
[root@ip-172-30-20-187 /]# cat /etc/kubernetes/kubelet/config.json | grep shutdownGracePeriod
    "shutdownGracePeriod": "45s",
    "shutdownGracePeriodCriticalPods": "15s",

toVersus avatar Apr 11 '24 02:04 toVersus

After some discussion with the EBS CSI Driver team, I think the real solve here is for the client-side drain behavior to actually wait on the volume detachment since that happens asynchronously. Realistically, it seems like this would also be a problem for Cluster Autoscaler if they don't have logic to implement this as well, so we should think about how to solve this generally across SIG Autoscaling if we can.

We're receptive to someone taking this change in and implementing it. The change should be waiting after we drain here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/controller.go#L86. Realistically, for this to work, we need to make sure to avoid draining the EBS CSI driver DaemonSet, which means that EBS shouldn't tolerate the Karpenter taint so that it sticks around for the entire lifetime of the node. Either that or it gets drained as part of the teramination flow and it keeps its pre-stop logic like it has today.

jonathan-innis avatar Apr 11 '24 20:04 jonathan-innis

@jonathan-innis That's interesting and the opposite of what I've thought. So the EBS CSI node pod should not tolerate all tainst (which is the current default) in order to get drained, which runs the preStop hook as described here

youwalther65 avatar Apr 12 '24 08:04 youwalther65

I have been thinking about how to determine if a volume has been unmounted. In the current implementation, we wait for the evicted Pods on the node to go into the Succeeded / Failed phase, but this is not sufficient.

A similar discussion can be seen in the upstream regarding the PVC deletion protection feature at https://github.com/kubernetes/kubernetes/pull/123320#discussion_r1512323689, where it's pointed out that even if the Pod phase becomes Succeeded / Failed, it does not guarantee that the volume has been detached. Therefore, there seems to be an attempt to add a new Pod condition called PodTerminated to indicate that the volume has been unmounted, and KEP-4569: Conditions for terminated pod was just created.

Since I think this feature would be useful for Karpenter / Cluster Autoscaler as well, how about mentioning it as a use case?

toVersus avatar Apr 14 '24 15:04 toVersus

@jonathan-innis @jmdeal do you have any plan for this issue?

levanlongktmt avatar May 22 '24 05:05 levanlongktmt

Closing, solved by #1294.

jmdeal avatar Aug 15 '24 22:08 jmdeal