aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

Persistent Volumes with reclaimPolicy set to delete are not getting deleted

Open harshaisgud opened this issue 2 years ago • 11 comments
trafficstars

/kind bug

What happened? Persistent Volumes with reclaimPolicy set to delete are not getting deleted What you expected to happen? Persistent Volumes with reclaimPolicy set to delete should get deleted How to reproduce it (as minimally and precisely as possible)? 1)Create a runnerSet which is based on stateful set 2) Observe that PV are not geting reclaimed Anything else we need to know?: Errors from csi-provisioner container E0213 14:24:47.649540 1 controller.go:1481] delete "pvc-fb37fa53-9249-44ed-9d69-19e00767e3ae": volume deletion failed: persistentvolume pvc-fb37fa53-9249-44ed-9d69-19e00767e3ae is still attached to node ip-10-10-2-245.eu-central-1.compute.internal

Environment

  • Kubernetes version (use kubectl version): 1.23
  • Driver version: v1.15.0-eksbuild.1

harshaisgud avatar Feb 13 '23 14:02 harshaisgud

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 14 '23 15:05 k8s-triage-robot

Same issue here :+1:

NilsGriebner avatar May 31 '23 08:05 NilsGriebner

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 30 '23 09:06 k8s-triage-robot

Same issue here 👍

harshal-shah avatar Aug 22 '23 11:08 harshal-shah

/remove-lifecycle rotten

AndrewSirenko avatar Sep 27 '23 14:09 AndrewSirenko

Seeing the same with Kubernetes v1.27

h4ck3rk3y avatar Dec 11 '23 11:12 h4ck3rk3y

I had to add the following to get the creation + deletion to work

ec2:CreateVolume // creation
ec2:CreateTags // creation
ec2:AttachVolume // creation
ec2:DetachVolume // deletion
ec2:DeleteVolume // deletion

You are probably missing the last two actions

h4ck3rk3y avatar Dec 14 '23 07:12 h4ck3rk3y

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 13 '24 07:03 k8s-triage-robot

/remove-lifecycle stale

AndrewSirenko avatar Mar 18 '24 16:03 AndrewSirenko

I had to add the following to get the creation + deletion to work

ec2:CreateVolume // creation
ec2:CreateTags // creation
ec2:AttachVolume // creation
ec2:DetachVolume // deletion
ec2:DeleteVolume // deletion

You are probably missing the last two actions

It does not work with "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy" too.

602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/aws-ebs-csi-driver:v1.26.1 k8s version: v1.28.6-eks-508b6b3

sydorovdmytro avatar Apr 02 '24 07:04 sydorovdmytro

We experience the same issue. It seems to be a race condition.

Here are the logs of the csi-provisioner

I0517 11:22:47.641553       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-6a2a97fe-ded6-48c9-9f57-9fa7624c72e7", UID:"5aaae63a-4cca-4043-aea8-bcb53d578704", APIVersion:"v1", ResourceVersion:"143899969", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' persistentvolume pvc-6a2a97fe-ded6-48c9-9f57-9fa7624c72e7 is still attached to node ip-10-156-74-114.eu-central-1.compute.internal

and the logs of the ebs-plugin

I0517 11:22:52.094812       1 controller.go:471] "ControllerUnpublishVolume: detaching" volumeID="vol-0a570aeae7332b2e7" nodeID="i-075b7bf75732e92f7"
I0517 11:22:53.578965       1 cloud.go:862] "Waiting for volume state" volumeID="vol-0a570aeae7332b2e7" actual="detaching" desired="detached"
I0517 11:22:54.699412       1 cloud.go:862] "Waiting for volume state" volumeID="vol-0a570aeae7332b2e7" actual="detaching" desired="detached"
I0517 11:22:57.610889       1 cloud.go:862] "Waiting for volume state" volumeID="vol-0a570aeae7332b2e7" actual="detaching" desired="detached"
I0517 11:23:01.976813       1 controller.go:479] "ControllerUnpublishVolume: detached" volumeID="vol-0a570aeae7332b2e7" nodeID="i-075b7bf75732e92f7"

where pvc-6a2a97fe-ded6-48c9-9f57-9fa7624c72e7 corresponds to vol-0a570aeae7332b2e7 The deletion fails before detaching is finished.

Kubernetes version: 1.26.1 EBS driver version: 1.26.1

andreaskapfer avatar May 17 '24 12:05 andreaskapfer

/close

The original issue appears to be due to a bug in the Github actions Kubernetes runner: https://github.com/actions/actions-runner-controller/issues/2266

Volumes must not be attached to any node prior to deletion - this means they cannot be in use by any pod. If you delete a volume before it is detached, you will receive is still attached to node events until the corresponding pod(s) is deleted and the volume finishes detaching.

Note also that after a pod terminates, you may experience this event for a short period of time while the volume is detaching if the volume is immediately deleted. This is because volumes are not detached until after the pod terminates and detaching typically takes several seconds for EBS to perform. This is expected and the volume should proceed with deletion shortly after the detach succeeds.

If you are able to reproduce this error in the condition where no pod is currently using the volume and the volume gets stuck in the deleting state with an is still attached to node event for an extended period of time, please open a new issue with the reproduction steps (strongly preferred if possible) or relevant log entries.

ConnorJC3 avatar Aug 06 '24 14:08 ConnorJC3

@ConnorJC3: Closing this issue.

In response to this:

/close

The original issue appears to be due to a bug in the Github actions Kubernetes runner: https://github.com/actions/actions-runner-controller/issues/2266

Volumes must not be attached to any node prior to deletion - this means they cannot be in use by any pod. If you delete a volume before it is detached, you will receive is still attached to node events until the corresponding pod(s) is deleted and the volume finishes detaching.

Note also that after a pod terminates, you may experience this event for a short period of time while the volume is detaching if the volume is immediately deleted. This is because volumes are not detached until after the pod terminates and detaching typically takes several seconds for EBS to perform. This is expected and the volume should proceed with deletion shortly after the detach succeeds.

If you are able to reproduce this error in the condition where no pod is currently using the volume and the volume gets stuck in the deleting state with an is still attached to node event for an extended period of time, please open a new issue with the reproduction steps (strongly preferred if possible) or relevant log entries.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 06 '24 14:08 k8s-ci-robot