Failing to delete PV after attach failure
As I was playing around with topology I ran into an issue: I provisioned a volume and tried to attach it to a pod, which I expected. Then I deleted the pod successfully. The strange part is when I deleted the PVC, the PV went into Terminating state but never got deleted.
The PV still has an external-attach/driverName finalizer preventing its deletion. The associated VolumeAttachment object still exists, and it looks like this:
apiVersion: storage.k8s.io/v1beta1
kind: VolumeAttachment
metadata:
creationTimestamp: 2018-08-16T01:22:00Z
finalizers:
- external-attacher/com-google-csi-gcepd
name: csi-d4935b83fe8fb00a069d523e71d54c504366d49defd4c6cff33fe41f9cfa4b04
namespace: ""
resourceVersion: "460172"
selfLink: /apis/storage.k8s.io/v1beta1/volumeattachments/csi-d4935b83fe8fb00a069d523e71d54c504366d49defd4
c6cff33fe41f9cfa4b04
uid: c6230d88-a0f2-11e8-a4ee-42010a800002
spec:
attacher: com.google.csi.gcepd
nodeName: kubernetes-minion-group-11bk
source:
persistentVolumeName: pvc-c39b0473-a0f2-11e8-a4ee-42010a800002
status:
attachError:
message: PersistentVolume "pvc-c39b0473-a0f2-11e8-a4ee-42010a800002" is marked
for deletion
time: 2018-08-16T01:24:57Z
attached: false
The code here says the PV's finalizer won't be removed if the VolumeAttachment object still contains the PV name, so maybe that's not updated properly? https://github.com/kubernetes-csi/external-attacher/blob/a4398f84c4bd39f620052982789f565387070cb3/pkg/controller/csi_handler.go#L391
@jsafrane @saad-ali
Hmm, that's odd. VolumeAttachment is marked for deletion (finalizers: [external-attacher/com-google-csi-gcepd]). Question is why the attacher did not remove the finalizer in VolumeAttachment. This could mean that Detach() did not succeed and the attacher is not sure in what state the volume is. Can you check the attacher logs?
@verult, can you please check again if it's still reproducible? I really need the logs.
On it
This seems to trigger consistently when I induce an attach failure.
I0920 22:56:04.761145 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-1611b9
f2-bd28-11e8-b491-42010a800002" (UniqueName: "kubernetes.io/csi/com.google.csi.gcepd^us-central1-b/pvc-1611b9
f2-bd28-11e8-b491-42010a800002") from node "kubernetes-minion-group-zvl5"
E0920 22:56:04.767041 1 csi_attacher.go:174] kubernetes.io/csi: attachment for us-central1-b/pvc-1611b9
f2-bd28-11e8-b491-42010a800002 failed: rpc error: code = Unknown desc = googleapi: Error 404: The resource 'p
rojects/cxing-gke-dev/zones/us-central1-b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notF
ound
E0920 22:56:04.767182 1 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/com.google.c
si.gcepd^us-central1-b/pvc-1611b9f2-bd28-11e8-b491-42010a800002\"" failed. No retries permitted until 2018-09
-20 22:56:36.767135147 +0000 UTC m=+6704.714301709 (durationBeforeRetry 32s). Error: "AttachVolume.Attach fai
led for volume \"pvc-1611b9f2-bd28-11e8-b491-42010a800002\" (UniqueName: \"kubernetes.io/csi/com.google.csi.g
cepd^us-central1-b/pvc-1611b9f2-bd28-11e8-b491-42010a800002\") from node \"kubernetes-minion-group-zvl5\" : r
pc error: code = Unknown desc = googleapi: Error 404: The resource 'projects/cxing-gke-dev/zones/us-central1-
b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notFound"
I0920 22:56:04.767207 1 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"w
eb-server", UID:"473ee807-bd28-11e8-b491-42010a800002", APIVersion:"v1", ResourceVersion:"17718", FieldPath:"
"}): type: 'Warning' reason: 'FailedAttachVolume' AttachVolume.Attach failed for volume "pvc-1611b9f2-bd28-11
e8-b491-42010a800002" : rpc error: code = Unknown desc = googleapi: Error 404: The resource 'projects/cxing-g
ke-dev/zones/us-central1-b/disks/pvc-1611b9f2-bd28-11e8-b491-42010a800002' was not found, notFound
...
I0920 22:56:29.036905 1 pv_controller.go:601] volume "pvc-1611b9f2-bd28-11e8-b491-42010a800002" is released and reclaim policy "Delete" will be executed
I0920 22:56:29.049355 1 pv_controller.go:824] volume "pvc-1611b9f2-bd28-11e8-b491-42010a800002" entered phase "Released"
I0920 22:56:29.054342 1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
I0920 22:56:29.421258 1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
I0920 22:56:29.432568 1 pv_controller.go:1294] isVolumeReleased[pvc-1611b9f2-bd28-11e8-b491-42010a800002]: volume is released
...
and the "volume is released" message just gets printed repeatedly.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /lifecycle frozen
We can manually try to repro the issue by making attach fail with mock driver or something /help
@msau42: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/remove-lifecycle stale /lifecycle frozen
We can manually try to repro the issue by making attach fail with mock driver or something /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Ran into the similiar issue here. When we do helm delete, after deleting pods, which also has hooks to delete the pvcs. Found that some PV stuck in Terminating status.
kubectl describe pv pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 Name: pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: cinder.csi.openstack.org Finalizers: [external-attacher/cinder-csi-openstack-org] StorageClass: cinder-az-nova Status: Terminating (lasts 6d11h) Claim: mrnorton/datadir-mongocnf-cmgo-replica-1 Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 1Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: cinder.csi.openstack.org VolumeHandle: b12c4cc9-9c60-44b8-81ea-f09dd2b70c7c ReadOnly: false VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1579423079658-8081-cinder.csi.openstack.org Events: <none>
The attachment for this PV still exist with detach error.
`[root@bcmt-sandbox3-3c2w2e-1912-1-control-01 ~]# kubectl get volumeattachment |grep pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc cinder.csi.openstack.org pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007 bcmt-sandbox3-3c2w2e-1912-1-worker-02 true 7d6h
[root@bcmt-sandbox3-3c2w2e-1912-1-control-01 ~]# kubectl describe volumeattachment csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc
Name: csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc
Namespace:
Labels:
From csi-attacher log, it failed to detach at 0205 21:00:35.73930. From csi-provisioner log, the pv is deleted at 0205 21:00:35.722111. Means it deleted before volume detach got success.
0205 21:00:30.785723 1 csi_handler.go:140] Starting detach operation for "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:30.785886 1 csi_handler.go:147] Detaching "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.719744 1 csi_handler.go:480] CSIHandler: processing PV "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": VA "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" found I0205 21:00:35.729002 1 csi_handler.go:480] CSIHandler: processing PV "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": VA "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" found I0205 21:00:35.739302 1 csi_handler.go:428] Saving detach error to "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.756415 1 controller.go:141] Ignoring VolumeAttachment "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" change I0205 21:00:35.756663 1 csi_handler.go:439] Saved detach error to "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc" I0205 21:00:35.756689 1 csi_handler.go:99] Error processing "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc": failed to detach: rpc error: code = NotFound desc = ControllerUnpublishVolume Volume not found I0205 21:00:36.756962 1 controller.go:175] Started VA processing "csi-bfe5b6f39e81fcf053d9542200844c6fe147276d7ad19438c37c92acc6eaf4cc"
I0205 21:00:35.722061 1 controller.go:1385] delete "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": persistentvolume deleted I0205 21:00:35.722111 1 controller.go:1387] delete "pvc-31d6cb9c-1ebd-421b-a13f-b5abf23ea007": succeeded
I guess when we did helm delete, if first do "kubectl delete pod" and this triggers the volume detach. While it takes too much time to detach a volume. Then the post-delete jod do "kubectl delete pvc", then this trigger the volume to be deleted. After the volume is deleted, the volume deattach still not get the success response.
@msau42, Any plan to fix this issue.
I think the problem is in the cinder driver where it cannot handle the case where detach is called on a volume that has been deleted.
Thanks @msau42, I'll open a ticket to cinder driver.
Any update on this???
Looks like CSI driver specific issue ( Cinder) from previous comment . Closing this issue based on that.