aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

Pods stuck terminating with migrated volumes

Open shanecunningham opened this issue 4 years ago • 22 comments

Hello, i'm having an issue deleting pods that have migrated PVs. after installing aws-ebs-csi-driver (0.7.1) and enabling CSIMigrationAWS, new PVs/pods work fine from a CSI storageclass, they can be restarted and terminated normally. I also see that all PVs were migrated to CSI and show attached with volumeattachment. resizes on those migrated PVs also work fine. however, when I delete a pod with a migrated PV it hangs in Terminating status. the only workaround i've found is to terminate the instance and have the ASG spin up a new one. the pod then gets rescheduled and it can mount the volume. this is a cluster created on EC2 with kubeadm/v1.18.6. this seems to be the primary error message being looped in the worker node kubelet logs:

Jan 05 16:08:06 ip-<nodeip> kubelet[10196]: E0105 16:08:06.435390   10196 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-<volid> podName: nodeName:}" failed. No retries permitted until 2021-01-05 16:08:07.435350263 +0000 UTC m=+1793425.618460239 (durationBeforeRetry 1s). Error: "GetDeviceMountRefs check failed for volume \"pvc-<pvcid>\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-<volid>\") on node \"ip-<nodeip>.<regionid>.compute.internal\" : The device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-<pvcid>/globalmount\" is still mounted by other references [/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/<regionid>/vol-<volid> /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~aws-ebs/pvc-<pvcid>]"

node ebs-plugin logs

I0105 16:08:05.828673       1 node.go:337] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~csi/pvc-<pvcid>/mount
I0105 16:08:05.831799       1 mount_linux.go:202] Unmounting /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~csi/pvc-<pvcid>/mount

No other useful logs for the unmount ^, so it seems like it hangs there.

Volume is not attached to the node, confirmed through AWS console, but still has a mount point.

root@ip-10-nodeip:~# lsblk
NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  40G  0 disk
└─nvme0n1p1 259:1    0  40G  0 part /

root@ip-10-nodeip:~# grep pvc /proc/mounts
/dev/nvme1n1 /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~aws-ebs/pvc-<pvcid> ext4 rw,relatime 0 0
/dev/nvme1n1 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-<pvcid>/globalmount ext4 rw,relatime 0 0

shanecunningham avatar Jan 06 '21 01:01 shanecunningham

When you say new pods work fine, you mean pods that explicitly specify csi? like pvs that say "spec.csi" and are not going through migration?

And the ones that don't are the ones that say "spec.awsebs" and are going through migration?

Could you share the PV volume spec ( can redact region/volid info if you wish), IIRC there are a bunch of accepted formats like aws://vol-id and I want to check which format you are using, and use the same one to reproduce the issue.

It's just a suspicion I have because the code derives the mount paths and such using the volume spec, like this one

/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/<regionid>/vol-<volid>

Besides that, just looking at the logs ~~off the top of my head~~, the fact that the driver is trying to unmount /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~csi/pvc-<pvcid>/mount but kubelet says the volume is mounted at /var/lib/kubelet/pods/c8a81ef8-ea76-4937-8b29-6750dd5c6574/volumes/kubernetes.io~aws-ebs/pvc-<pvcid> (at first glance similar but NOT the same path) is very suspect

wongma7 avatar Jan 07 '21 23:01 wongma7

When you say new pods work fine, you mean pods that explicitly specify csi? like pvs that say "spec.csi" and are not going through migration?

Correct, new pods that are using a PVC created from a CSI storageclass that have not gone through migration can be created/deleted fine.

And the ones that don't are the ones that say "spec.awsebs" and are going through migration?

Could you share the PV volume spec ( can redact region/volid info if you wish), IIRC there are a bunch of accepted formats like aws://vol-id and I want to check which format you are using, and use the same one to reproduce the issue.

This is from a StatefulSet and the pod when deleted hangs on termination status

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    kubernetes.io/createdby: aws-ebs-dynamic-provisioner
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/migrated-to: ebs.csi.aws.com
    pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
  creationTimestamp: "2020-03-25T17:27:47Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/ebs-csi-aws-com
  labels:
    failure-domain.beta.kubernetes.io/region: <region>
    failure-domain.beta.kubernetes.io/zone: <region>a
  name: pvc-<pvcid>
  resourceVersion: "151990241"
  selfLink: /api/v1/persistentvolumes/pvc-<pvcid>
  uid: 17c467eb-cbcc-4339-87c9-3de48e09d93f
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://<region>a/vol-<volid>
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: alertmanager-monitoring-prometheus-oper-alertmanager-db-alertmanager-monitoring-prometheus-oper-alertmanager-0
    namespace: monitoring
    resourceVersion: "1623752"
    uid: <pvcid>
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - <region>a
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - <region>
  persistentVolumeReclaimPolicy: Delete
  storageClassName: gp2
  volumeMode: Filesystem

shanecunningham avatar Jan 08 '21 02:01 shanecunningham

@appprod0 did you drain kubelet before applying the CSIMigrationAWS feature-gate and restarting kubelet? We had to do that for the migration flag to work otherwise we get the exact same error as you. I don't think you can just turn on the feature-gate, delete the existing pod(s), and expect it to work. Hence why your new node on from your ASG works.

mitchellhuang avatar Mar 23 '21 05:03 mitchellhuang

@wongma7 ,

I think this is reproducible using the following test:

In-tree Volumes [Driver: aws] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly] [Serial] [Suite:k8s]

Taking the following steps:

  1. Setup a cluster on AWS
  2. Enable csi migration
  3. Run the in-tree e2e test
  4. At this point the test fails with similar errors above. The Pod is stuck in a Terminating state, the PVC has a deletion timestamp, but we see the following continually on the node's journalctl:
May 12 14:05:09 ip-10-0-165-52 hyperkube[51601]: I0512 14:05:09.714532   51601 reconciler.go:312] "operationExecutor.UnmountDevice started for volume \"pvc-8a72162b-db8b-4499-b9e4-623fe22241f5\" (UnqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0cdb1a8eaed31e57a\") on node \"ip-10-0-165-52.ap-northeast-1.compute.internal\" "
May 12 14:05:09 ip-10-0-165-52 hyperkube[51601]: E0512 14:05:09.716749   51601 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0cdb1a8eaed31e57a podNme: nodeName:}" failed. No retries permitted until 2021-05-12 14:07:11.716722698 +0000 UTC m=+23113.217236926 (durationBeforeRetry 2m2s). Error: "GetDeviceMountRefs check failed for volume \"pvc-8a7162b-db8b-4499-b9e4-623fe22241f5\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0cdb1a8eaed31e57a\") on node \"ip-10-0-165-52.ap-northeast-1.compute.internal\" : the device mount path \"/varlib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/globalmount\" is still mounted by other references [/var/lib/kubelet/pods/bcc53e0c-3c2d-453a-aa13-7e81f67c1c6e/volums/kubernetes.io~csi/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/mount]"

I've also confirmed that this is the same path that's on the node:

# mount | grep pvc-8a
/dev/nvme1n1 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/globalmount type ext4 (rw,relatime,seclabel)
/dev/nvme1n1 on /var/lib/kubelet/pods/bcc53e0c-3c2d-453a-aa13-7e81f67c1c6e/volumes/kubernetes.io~csi/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/mount type ext4 (rw,relatime,seclabel)

In this case we're seeing the path as kubernetes.io~csi , but the PV's spec shows:

PV:
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
    finalizers:
    - kubernetes.io/pv-protection
    - external-attacher/ebs-csi-aws-com
    labels:
      topology.kubernetes.io/region: ap-northeast-1
      topology.kubernetes.io/zone: ap-northeast-1c
    name: pvc-8a72162b-db8b-4499-b9e4-623fe22241f5
    [...]
  spec:
    accessModes:
    - ReadWriteOnce
    awsElasticBlockStore:
      fsType: ext4
      volumeID: vol-0cdb1a8eaed31e57a

So here it's reported as awsebs in the PV's spec, but it's provisioned by the CSI driver. Here's an excerpt from the external-provisioner:

I0512 07:40:40.719261       1 controller.go:839] successfully created PV pvc-8a72162b-db8b-4499-b9e4-623fe22241f5 for PVC aws4hs29 and csi volume name vol-0cdb1a8eaed31e57a
I0512 07:40:40.719358       1 controller.go:1439] provision "e2e-provisioning-5083/aws4hs29" class "e2e-provisioning-5083n5vpr": volume "pvc-8a72162b-db8b-4499-b9e4-623fe22241f5" provisioned
I0512 07:40:40.719383       1 controller.go:1456] provision "e2e-provisioning-5083/aws4hs29" class "e2e-provisioning-5083n5vpr": succeeded
I0512 07:40:40.738914       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"e2e-provisioning-5083", Name:"aws4hs29", UID:"8a72162b-db8b-4499-b9e4-623fe22241f5", APIVersion:"v1", ResourceVersion:"156954", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-8a72162b-db8b-4499-b9e4-623fe22241f5

This failure is consistently reproducible when using the aforementioned test. I've added it here, as it appears to be very similar (if not the same). Would you have any thoughts on this one?

huffmanca avatar May 12 '21 17:05 huffmanca

In the original issue the paths were different, and it looks like it casn happen if feature gate is enabled without draining.

In this one the paths are same, and actually I think the path " /var/lib/kubelet/pods/bcc53e0c-3c2d-453a-aa13-7e81f67c1c6e/volumes/kubernetes.io~csi/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/moun" looks correct. even in migration scenario the path should be csi .

however I'm not that familiar with this test case, what is responsible for unmounting the path if kubelet is down

wongma7 avatar May 12 '21 19:05 wongma7

Just looking through the test I don't see any explicit unmounts. It seems that the primary unmounting is expected in test/e2e/storage/utils/utils.go in TestVolumeUnmountsFromDeletedPodWithForceOption. Perhaps simply from the Pod deletion while the kubelet is down?

huffmanca avatar May 13 '21 20:05 huffmanca

OK I checked that test and it stops kubelet, deletes the pod, then starts kubelet, so I misunderstood I thought there was an expectation that the volume gets unmounted while kubelet is down, but it's testing that the volume gets unmounted after kubelet comes back up. So yeah if we are failing that, it's a real bug, I think it has to do with kubelet's volume reconstruction logic (i.e. reconstruct its cache of what volumes are mounted given the paths of volumes on disk) so will have to repro and check kubelet logs to be sure what's going on.

wongma7 avatar May 14 '21 00:05 wongma7

It started happening to me after cluster upgrade from 1.19.x to 1.20.x

zerkms avatar May 27 '21 09:05 zerkms

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 25 '21 10:08 k8s-triage-robot

/remove-lifecycle stale

zerkms avatar Aug 26 '21 01:08 zerkms

It started happening to me after cluster upgrade from 1.19.x to 1.20.x

I believe I'm seeing the same issues after upgrading from 1.19.6 to 1.20.9 too. Granted I do no believe the volumes in question have been migrated. As far as I can the kubetlet asks the CSI driver to unmount the volume, it logs that it does but it is still mounted. If I manually unmount the volume from the linux host kubelet will finish Terminating the pod or if I restart kubelet it will ask the CSI driver to unmount the volume again and it appears to always work the second time.

I did notice a general block device error for that volume in the system logs around that time too.

Sep 10 14:59:24 kernel: [805556.565787] EXT4-fs warning (device nvme1n1): htree_dirblock_to_tree:994: inode #2: lblock 0: comm ls: error -5 reading directory block
Sep 10 14:59:24  kernel: EXT4-fs warning (device nvme1n1): htree_dirblock_to_tree:994: inode #2: lblock 0: comm ls: error -5 reading directory block

gregsidelinger avatar Sep 10 '21 17:09 gregsidelinger

@gregsidelinger This issue might also be relevant to you https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1027

I'm able to replicate this on all volumes, not just migrated or ephemeral. There's a fix in v.1.2.1 that is supposed to resolve this but after using that version I can still replicate the stuck pods terminating.

shanecunningham avatar Sep 10 '21 22:09 shanecunningham

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 09 '21 23:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 08 '22 23:01 k8s-triage-robot

/remove-lifecycle rotten

I have not seen this recently, is it just luck or in 1.20 it's not a problem anymore?

zerkms avatar Jan 09 '22 20:01 zerkms

I see there was some fix in aws-ebs-csi driver in 1.4.0 version - https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1082. @zerkms Have you tried to upgrade it, because I also see the issue with the terminating Pods, but I haven't upgraded the ebs driver yet.

yhrunyk avatar Jan 10 '22 15:01 yhrunyk

@yhrunyk I have not upgraded yet, and I don't see the problem on 1.21 (it was a typo in my previous message, this one is right)

zerkms avatar Jan 10 '22 22:01 zerkms

Seeing the same issue on our cluster. We are on EKS v.1.20 and using AWS EBS CSI driver 1.5.0.

RicardsRikmanis avatar Mar 23 '22 11:03 RicardsRikmanis

Same issue with EKS v1.21, the volume's been terminating for hours. Not sure which EBS CSI driver version we're using or how to check. We don't have it installed as a addon. I ended up deleting the pod that used the volume and then the pvc delete worked.

caleb15 avatar May 16 '22 23:05 caleb15

I'm seeing this same thing on EKS 1.20 without much luck resolving. Has anyone on this issue solved this yet?

gkrizek avatar Jun 17 '22 04:06 gkrizek

@wongma7 Have you revisited this at all? I'm having a hard time determining if it's Kubelet or some EBS CSI component. Our recent error message:

Jun 17 02:07:14 ip-172-18-53-38.hostname.internal kubelet[9067]: E0617 02:07:14.950549    9067 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-00e809exxxxxxxxc podName: nodeName:}" failed. No retries permitted until 2022-06-17 02:07:15.45051353 +0000 UTC m=+1259465.346839830 (durationBeforeRetry 500ms). Error: "GetDeviceMountRefs check failed for volume \"pvc-b121768a-dea1-4e27-8502-b04597eae840\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-00e809exxxxxxxx\") on node \"ip-1-2-3-4.us-west-2.compute.internal\" : the device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b121768a-dea1-4e27-8502-b04597eae840/globalmount\" is still mounted by other references [/var/lib/kubelet/pods/18985772-87b1-4b0e-87fb-b66e83c25619/volumes/kubernetes.io~csi/pvc-b121768a-dea1-4e27-8502-b04597eae840/mount]"

gkrizek avatar Jun 17 '22 12:06 gkrizek

In my case, pod stuck in Init state after migrating from in-tree plugin to EBS CSI, I followed this doc - https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md#migrating-from-in-tree-ebs-plugin and did same steps, but got these logs -

Sep 05 10:16:42 ip-10-222-101-237.fcinternal.net kubelet[8233]: E0905 10:16:42.916187 8233 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-02378debf01dbaee9 podName: nodeName:}" failed. No retries permitted until 2022-09-05 10:18:44.916161378 +0000 UTC m=+7250.080135420 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume "pvc-3eac9d49-b875-4258-849b-7ca94f78014c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-02378debf01dbaee9") pod "elasticsearch-ha-master-sts-0" (UID: "839f19da-c301-47c0-9970-909ddfad92e4") "

Has anyone got the solution?

aashishgoyal246 avatar Sep 05 '22 10:09 aashishgoyal246

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 04 '22 10:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 03 '23 11:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Feb 02 '23 12:02 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 02 '23 12:02 k8s-ci-robot