external-attacher Failing to delete PV after attach failure

As I was playing around with topology I ran into an issue: I provisioned a volume and tried to attach it to a pod, which I expected. Then I deleted the pod successfully. The strange part is when I deleted the PVC, the PV went into Terminating state but never got deleted.

The PV still has an external-attach/driverName finalizer preventing its deletion. The associated VolumeAttachment object still exists, and it looks like this:

apiVersion: storage.k8s.io/v1beta1
  kind: VolumeAttachment
  metadata:
    creationTimestamp: 2018-08-16T01:22:00Z
    finalizers:
    - external-attacher/com-google-csi-gcepd
    name: csi-d4935b83fe8fb00a069d523e71d54c504366d49defd4c6cff33fe41f9cfa4b04
    namespace: ""
    resourceVersion: "460172"
    selfLink: /apis/storage.k8s.io/v1beta1/volumeattachments/csi-d4935b83fe8fb00a069d523e71d54c504366d49defd4
c6cff33fe41f9cfa4b04
    uid: c6230d88-a0f2-11e8-a4ee-42010a800002
  spec:
    attacher: com.google.csi.gcepd
    nodeName: kubernetes-minion-group-11bk
    source:
      persistentVolumeName: pvc-c39b0473-a0f2-11e8-a4ee-42010a800002
  status:
    attachError:
      message: PersistentVolume "pvc-c39b0473-a0f2-11e8-a4ee-42010a800002" is marked
        for deletion
      time: 2018-08-16T01:24:57Z
    attached: false

The code here says the PV's finalizer won't be removed if the VolumeAttachment object still contains the PV name, so maybe that's not updated properly? https://github.com/kubernetes-csi/external-attacher/blob/a4398f84c4bd39f620052982789f565387070cb3/pkg/controller/csi_handler.go#L391

@jsafrane @saad-ali

Aug 16 '18 01:08 verult

Hmm, that's odd. VolumeAttachment is marked for deletion (finalizers: [external-attacher/com-google-csi-gcepd]). Question is why the attacher did not remove the finalizer in VolumeAttachment. This could mean that Detach() did not succeed and the attacher is not sure in what state the volume is. Can you check the attacher logs?

Aug 16 '18 07:08 jsafrane

@verult, can you please check again if it's still reproducible? I really need the logs.

Sep 20 '18 16:09 jsafrane

On it

Sep 20 '18 20:09 verult

This seems to trigger consistently when I induce an attach failure.

I0920 22:56:04.761145       1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-1611b9
f2-bd28-11e8-b491-42010a800002" (UniqueName: "kubernetes.io/csi/com.google.csi.gcepd^us-central1-b/pvc-1611b9
f2-bd28-11e8-b491-42010a800002") from node "kubernetes-minion-group-zvl5" 
E0920 22:56:04.767041       1 csi_attacher.go:174] kubernetes.io/csi: attachment for us-central1-b/pvc-1611b9
f2-bd28-11e8-b491-42010a800002 failed: rpc error: code = Unknown desc = googleapi: Error 404: The resource 'p
rojects/cxing-gke-dev/zones/us-central1-b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notF
ound
E0920 22:56:04.767182       1 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/com.google.c
si.gcepd^us-central1-b/pvc-1611b9f2-bd28-11e8-b491-42010a800002\"" failed. No retries permitted until 2018-09
-20 22:56:36.767135147 +0000 UTC m=+6704.714301709 (durationBeforeRetry 32s). Error: "AttachVolume.Attach fai
led for volume \"pvc-1611b9f2-bd28-11e8-b491-42010a800002\" (UniqueName: \"kubernetes.io/csi/com.google.csi.g
cepd^us-central1-b/pvc-1611b9f2-bd28-11e8-b491-42010a800002\") from node \"kubernetes-minion-group-zvl5\" : r
pc error: code = Unknown desc = googleapi: Error 404: The resource 'projects/cxing-gke-dev/zones/us-central1-
b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notFound"
I0920 22:56:04.767207       1 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"w
eb-server", UID:"473ee807-bd28-11e8-b491-42010a800002", APIVersion:"v1", ResourceVersion:"17718", FieldPath:"
"}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-1611b9f2-bd28-11
e8-b491-42010a800002" : rpc error: code = Unknown desc = googleapi: Error 404: The resource 'projects/cxing-g
ke-dev/zones/us-central1-b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notFound
...
I0920 22:56:29.036905       1 pv_controller.go:601] volume "pvc-1611b9f2-bd28-11e8-b491-42010a800002" is released and reclaim policy "Delete" will be executed
I0920 22:56:29.049355       1 pv_controller.go:824] volume "pvc-1611b9f2-bd28-11e8-b491-42010a800002" entered phase "Released"
I0920 22:56:29.054342       1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
I0920 22:56:29.421258       1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
I0920 22:56:29.432568       1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
...

and the "volume is released" message just gets printed repeatedly.

Sep 20 '18 23:09 verult

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Apr 25 '19 17:04 fejta-bot

/remove-lifecycle stale /lifecycle frozen

We can manually try to repro the issue by making attach fail with mock driver or something /help

Apr 27 '19 01:04 msau42

@msau42: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/remove-lifecycle stale /lifecycle frozen

We can manually try to repro the issue by making attach fail with mock driver or something /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 27 '19 01:04 k8s-ci-robot

Ran into the similiar issue here. When we do helm delete, after deleting pods, which also has hooks to delete the pvcs. Found that some PV stuck in Terminating status.

kubectl describe pv pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 Name: pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: cinder.csi.openstack.org Finalizers: [external-attacher/cinder-csi-openstack-org] StorageClass: cinder-az-nova Status: Terminating (lasts 6d11h) Claim: mrnorton/datadir-mongocnf-cmgo-replica-1 Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 1Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: cinder.csi.openstack.org VolumeHandle: b12c4cc9-9c60-44b8-81ea-f09dd2b70c7c ReadOnly: false VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1579423079658-8081-cinder.csi.openstack.org Events: <none>

The attachment for this PV still exist with detach error.

`[root@bcmt-sandbox3-3c2w2e-1912-1-control-01 ~]# kubectl get volumeattachment |grep pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc cinder.csi.openstack.org pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 bcmt-sandbox3-3c2w2e-1912-1-worker-02 true 7d6h

[root@bcmt-sandbox3-3c2w2e-1912-1-control-01 ~]# kubectl describe volumeattachment csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc Name: csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc Namespace: Labels: Annotations: csi.alpha.kubernetes.io/node-id: 95fdffb9-035e-41b8-bb88-263c77e94f7a API Version: storage.k8s.io/v1 Kind: VolumeAttachment Metadata: Creation Timestamp: 2020-02-05T20:11:21Z Deletion Grace Period Seconds: 0 Deletion Timestamp: 2020-02-05T21:00:30Z Finalizers: external-attacher/cinder-csi-openstack-org Resource Version: 31925915 Self Link: /apis/storage.k8s.io/v1/volumeattachments/csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc UID: b8429920-5985-4b95-a4cc-7337a8d87ddb Spec: Attacher: cinder.csi.openstack.org Node Name: bcmt-sandbox3-3c2w2e-1912-1-worker-02 Source: Persistent Volume Name: pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 Status: Attached: true Attachment Metadata: Device Path: /dev/vdi Detach Error: Message: rpc error: code = NotFound desc = ControllerUnpublishVolume Volume not found Time: 2020-02-13T02:14:20Z Events: `

From csi-attacher log, it failed to detach at 0205 21:00:35.73930. From csi-provisioner log, the pv is deleted at 0205 21:00:35.722111. Means it deleted before volume detach got success.

0205 21:00:30.785723 1 csi_handler.go:140] Starting detach operation for "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:30.785886 1 csi_handler.go:147] Detaching "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.719744 1 csi_handler.go:480] CSIHandler: processing PV "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": VA "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" found I0205 21:00:35.729002 1 csi_handler.go:480] CSIHandler: processing PV "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": VA "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" found I0205 21:00:35.739302 1 csi_handler.go:428] Saving detach error to "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.756415 1 controller.go:141] Ignoring VolumeAttachment "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" change I0205 21:00:35.756663 1 csi_handler.go:439] Saved detach error to "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.756689 1 csi_handler.go:99] Error processing "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc": failed to detach: rpc error: code = NotFound desc = ControllerUnpublishVolume Volume not found I0205 21:00:36.756962 1 controller.go:175] Started VA processing "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc"

I0205 21:00:35.722061 1 controller.go:1385] delete "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": persistentvolume deleted I0205 21:00:35.722111 1 controller.go:1387] delete "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": succeeded

I guess when we did helm delete, if first do "kubectl delete pod" and this triggers the volume detach. While it takes too much time to detach a volume. Then the post-delete jod do "kubectl delete pvc", then this trigger the volume to be deleted. After the volume is deleted, the volume deattach still not get the success response.

Feb 13 '20 06:02 celiawa

@msau42, Any plan to fix this issue.

Feb 20 '20 02:02 celiawa

I think the problem is in the cinder driver where it cannot handle the case where detach is called on a volume that has been deleted.

Feb 20 '20 02:02 msau42

Thanks @msau42, I'll open a ticket to cinder driver.

Feb 20 '20 09:02 celiawa

Any update on this???

Aug 02 '21 05:08 sunil-barhate

Looks like CSI driver specific issue ( Cinder) from previous comment . Closing this issue based on that.

Nov 10 '22 14:11 humblec