gluster-csi-driver
gluster-csi-driver copied to clipboard
Single PVC's going in pending state after continuous creation and deletion
PVC going in pending state after creating and deleting single PVC in loop.
[vagrant@kube1 ~]$ kubectl describe pvc
Name: gcs-pvc1
Namespace: default
StorageClass: glusterfs-csi
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 8s (x3 over 20s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Mounted By: <none>
[vagrant@kube1 ~]$
[vagrant@kube1 ~]$ kubectl describe pvc
Name: gcs-pvc1
Namespace: default
StorageClass: glusterfs-csi
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 8s (x3 over 20s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Mounted By: <none>
[vagrant@kube1 ~]$ kubectl describe pvc
Name: gcs-pvc1
Namespace: default
StorageClass: glusterfs-csi
Status: Pending
Volume:
Labels: <none>
Annotations: control-plane.alpha.kubernetes.io/leader:
{"holderIdentity":"4c17a0c4-d2a6-11e8-86f7-0a580ae94209","leaseDurationSeconds":15,"acquireTime":"2018-10-18T08:23:09Z","renewTime":"2018-...
volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 19m (x25 over 22m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Normal ExternalProvisioning 16m (x19 over 17m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Normal ExternalProvisioning 4m5s (x221 over 14m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Warning ProvisioningFailed 3m42s (x10 over 15m) org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209 Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
Normal Provisioning 92s (x13 over 15m) org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209 External provisioner is provisioning volume for claim "default/gcs-pvc1"
Warning ProvisioningFailed 92s (x3 over 8m17s) org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209 Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details Get http://glusterd2-client.gcs:24007/v1/volumes/pvc-36fc6b59d2b011e8/status: dial tcp 10.233.60.197:24007: connect: connection refused
Mounted By: <none>
[vagrant@kube1 ~]$
Debugging info needed:
- Status of GCS pods:
kubectl get -n gcs all
- Logs from the CSI driver & gluster pods
From PVC describe
its an issue with glusterd2
and ETCD cluster
Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
This is the error message from glusterd2 when it's not able to connect to the ETCD
fetching volume details Get http://glusterd2-client.gcs:24007/v1/volumes/pvc-36fc6b59d2b011e8/status: dial tcp 10.233.60.197:24007: connect: connection refused
This is the error message seen when CSI driver is not able to connect to the glusterd2 pod.
AS per my understanding, this is not the issue with the CSI driver
rather its the issue in glusterd2 cluster
@aravindavk #80 could be possibly related to the same issue. I have a question - Why are we even requesting etcd everytime we have successfully created the volume ? We already create a volinfo in one of the transaction steps during volume-create, additional etcd queries can be avoided, I guess.
[vagrant@kube1 ~]$ kubectl get -n gcs all
NAME READY STATUS RESTARTS AGE
pod/csi-attacher-glusterfsplugin-0 2/2 Running 0 12h
pod/csi-nodeplugin-glusterfsplugin-4dkdj 2/2 Running 0 12h
pod/csi-nodeplugin-glusterfsplugin-998k2 2/2 Running 0 12h
pod/csi-nodeplugin-glusterfsplugin-gsv64 2/2 Running 0 12h
pod/csi-provisioner-glusterfsplugin-0 2/2 Running 0 12h
pod/etcd-7xhlbpk8t5 1/1 Running 0 12h
pod/etcd-d6tgccsrkv 1/1 Running 0 12h
pod/etcd-operator-7cb5bd459b-zm4xn 1/1 Running 0 12h
pod/etcd-vpnmtxxpmz 1/1 Running 0 12h
pod/gluster-kube1-0 1/1 Running 1 12h
pod/gluster-kube2-0 1/1 Running 0 12h
pod/gluster-kube3-0 1/1 Running 0 12h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/etcd ClusterIP None <none> 2379/TCP,2380/TCP 12h
service/etcd-client ClusterIP 10.233.45.50 <none> 2379/TCP 12h
service/glusterd2 ClusterIP None <none> 24007/TCP,24008/TCP 12h
service/glusterd2-client ClusterIP 10.233.44.221 <none> 24007/TCP 12h
service/glusterd2-client-nodeport NodePort 10.233.41.38 <none> 24007:31007/TCP 12h
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/csi-nodeplugin-glusterfsplugin 3 3 3 3 3 <none> 12h
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.apps/etcd-operator 1 1 1 1 12h
NAME DESIRED CURRENT READY AGE
replicaset.apps/etcd-operator-7cb5bd459b 1 1 1 12h
NAME DESIRED CURRENT AGE
statefulset.apps/csi-attacher-glusterfsplugin 1 1 12h
statefulset.apps/csi-provisioner-glusterfsplugin 1 1 12h
statefulset.apps/gluster-kube1 1 1 12h
statefulset.apps/gluster-kube2 1 1 12h
statefulset.apps/gluster-kube3 1 1 12h
[vagrant@kube1 ~]$
From the logs provided
E1022 08:48:31.420202 1 controller.go:1138] Deletion of volume "pvc-032ca506d5d211e8" failed: rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused
E1022 08:48:31.420399 1 goroutinemap.go:150] Operation for "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]" failed. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (durationBeforeRetry 2m2s). Error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused"
I1022 08:48:46.388560 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:48:46.388682 1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:01.397761 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:01.397838 1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:16.415974 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:16.416053 1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:31.505858 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:31.505957 1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
This is the connection issue between glusterd2 and CSI provisioner
i feel this is not an issue with CSI driver @JohnStrunk any thoughts on this one.
First thing I see is:
I1022 07:47:58.161467 1 controller.go:900] volume "pvc-c9630f0cd5ce11e8" for claim "default/gcs-pvc1" created
I1022 07:47:58.185548 1 controller.go:917] volume "pvc-c9630f0cd5ce11e8" for claim "default/gcs-pvc1" saved
I1022 07:47:58.185606 1 controller.go:953] volume "pvc-c9630f0cd5ce11e8" provisioned for claim "default/gcs-pvc1"
I1022 07:47:58.295398 1 leaderelection.go:198] stopped trying to renew lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 07:48:16.056529 1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:16.255186 1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:17.227000 1 controller.go:1143] volume "pvc-c9630f0cd5ce11e8" deleted
I1022 07:48:17.257527 1 controller.go:1154] volume "pvc-c9630f0cd5ce11e8" deleted from database
I1022 07:48:17.258480 1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:17.296047 1 controller.go:1167] scheduleOperation[lock-provision-default/gcs-pvc1[d60b807c-d5ce-11e8-a89d-5254003f14bf]]
I1022 07:48:17.328763 1 controller.go:1143] volume "pvc-c9630f0cd5ce11e8" deleted
I1022 07:48:17.366335 1 controller.go:1150] failed to delete volume "pvc-c9630f0cd5ce11e8" from database: persistentvolumes "pvc-c9630f0cd5ce11e8" not found
I'm not sure why it failed to delete it from the "database" or even what database this is...
Seems like the real problems start here:
I1022 08:31:24.292024 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
I1022 08:31:31.306224 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
I1022 08:31:31.306300 1 controller.go:1167] scheduleOperation[delete-pvc-f683aab6d5d111e8[bcb42c53-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:42.653042 1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
E1022 08:31:42.679760 1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
I1022 08:31:42.722991 1 controller.go:1154] volume "pvc-f683aab6d5d111e8" deleted from database
I1022 08:31:42.727233 1 controller.go:1167] scheduleOperation[delete-pvc-f683aab6d5d111e8[bcb42c53-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:42.745796 1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
I1022 08:31:42.773592 1 leaderelection.go:156] attempting to acquire leader lease...
I1022 08:31:42.774356 1 controller.go:999] claim default/gcs-pvc1 was deleted
I1022 08:31:43.527429 1 controller.go:1143] volume "pvc-f683aab6d5d111e8" deleted
I1022 08:31:45.249844 1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.249974 1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:45.762912 1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.763047 1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:45.885358 1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.885477 1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:46.306409 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:52.624897 1 controller.go:1138] Deletion of volume "pvc-032ca506d5d211e8" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1022 08:31:52.626702 1 goroutinemap.go:150] Operation for "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]" failed. No retries permitted until 2018-10-22 08:31:53.125386042 +0000 UTC m=+44372.839749882 (durationBeforeRetry 500ms). Error: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
W1022 08:31:52.734114 1 controller.go:354] CreateVolume timeout: 10s has expired, operation will be retried
I1022 08:32:01.306706 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
W1022 08:32:07.734855 1 controller.go:354] CreateVolume timeout: 10s has expired, operation will be retried
I1022 08:32:16.306854 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:32:22.135344 1 controller.go:895] Failed to provision volume for claim "default/gcs-pvc1" with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
E1022 08:32:22.137797 1 goroutinemap.go:150] Operation for "provision-default/gcs-pvc1[b4d70d4e-d5d2-11e8-a89d-5254003f14bf]" failed. No retries permitted until 2018-10-22 08:32:22.637746324 +0000 UTC m=+44402.352110180 (durationBeforeRetry 500ms). Error: "rpc error: code = Internal desc = error in fetching volume details context deadline exceeded"
I1022 08:32:23.251369 1 leaderelection.go:156] attempting to acquire leader lease...
I1022 08:32:23.252056 1 controller.go:999] claim default/gcs-pvc1 was deleted
I1022 08:32:31.307239 1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:32:33.337181 1 controller.go:895] Failed to provision volume for claim "default/gcs-pvc1" with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
From my reading, the operation times out (10s) and gets scheduled for a retry (which is ok), then later, we start getting the connection refused. The thing to remember is that the CSI driver is talking to the GD2 service IP, so as long as any GD2 is up, it should be able to connect... connection refused would indicate they are all down.
I suspect it's a GD2 or etcd problem causing the failure.