longhorn
longhorn copied to clipboard
[BUG] "failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"
Describe the bug
After upgrading to v1.1.1 I am now having the longhorn-driver-deployer pod continually crash, never finishing.
To Reproduce
Sadly, I don't know of a specific way to reproduce this; I was having issues with my cluster when I upgraded, which didn't show up until after I started the upgrade; my calico configuration was using ip addresses from the wrong interface which caused some really odd issues throughout the cluster and I ended up force killing some of the pods. I suspect that contributed to getting into this state, but I have no idea how to resolve the issue at this point.
Expected behavior
Obviously the driver deployer should not crash and should complete as expected =] Since I don't actually know what it does I'm not sure what that means, really.
Log
2021/05/03 18:27:36 proto: duplicate proto type registered: VersionResponse
W0503 18:27:36.570867 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2021-05-03T18:27:36Z" level=debug msg="Deploying CSI driver"
time="2021-05-03T18:27:36Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:37Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:38Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-03T18:27:39Z" level=info msg="Proc found: kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-03T18:27:39Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-03T18:27:39Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-03T18:27:39Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-03T18:27:39Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-03T18:29:40Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"
Environment:
- Longhorn version: v1.1.1 (coming from v1.1.1-rc1)
- Installation method: kubectl
- Kubernetes distro: kubeadm, v1.21.0
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4 (plus the 3 management nodes which are also worker nodes)
- Node config
- OS type and version: Ubuntu 20.04.2 LTS
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): SSD/NVMe, some magnetic in RAID 0 (which work well, btw)
- Network bandwidth between the nodes: 10GBe
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: ~15
Additional context
Not sure what else would be useful; I am available in the slack channel for discussion if that would help. I have things finally stable now, but would really like to have that fixed as I don't know what issues it will cause.
From the log, it seems like it can't reach the Sevice csi-attacher.
- Are you able to use
kubectlto check the Sevicecsi-attacherexists? - Moreover, is the longhorn-manager Pod able to access the Service
csi-attacher? Like login to one of the longhorn-manager Pod and runnslookup csi-attacher. I'm worried that your Kubernetes cluster network configuration is not correct.
$ k -n longhorn-system get pods -l app=csi-attacher
NAME READY STATUS RESTARTS AGE
csi-attacher-869cccc7c9-9mn7l 1/1 Running 0 46h
csi-attacher-869cccc7c9-hgfnx 1/1 Running 0 46h
csi-attacher-869cccc7c9-zx6lr 1/1 Running 0 46h
$ kubectl -n longhorn-system exec -it longhorn-manager-7x9z2 -- bash
root@longhorn-manager-7x9z2:/# nslookup csi-attacher
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: csi-attacher.longhorn-system.svc.cluster.local
Address: 10.109.227.49
root@longhorn-manager-7x9z2:/# curl http://csi-attacher:12345
curl: (7) Failed to connect to csi-attacher port 12345: Connection refused
.. of course, the service names that port "dummy" so I'm guessing that doesn't actually mean anything?
More to the point on the DNS, though, the failing pod can also access it (if I catch it before it fails):
$ kubectl -n longhorn-system exec -it longhorn-driver-deployer-6c945db7f6-mrkgq -- bash
root@longhorn-driver-deployer-6c945db7f6-mrkgq:/# nslookup csi-attacher
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: csi-attacher.longhorn-system.svc.cluster.local
Address: 10.109.227.49
Could you please get the yaml manifest of Service csi-attacher, I'd like to check the deletionTimestamp metadata is set or not.
Ref to:
- https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion
- https://github.com/longhorn/longhorn-manager/blob/v1.1.1/csi/deployment_util.go#L261-L263
Once the "deletion in progress" state is set, the garbage collector deletes the object's dependents. Once the garbage collector has deleted all "blocking" dependents (objects with ownerReference.blockOwnerDeletion=true), it deletes the owner object.
I'm worried that the dependency object can't be deleted within 120 secs, so the time out triggered.
apiVersion: v1
kind: Service
metadata:
annotations:
driver.longhorn.io/kubernetes-version: v1.20.5
driver.longhorn.io/version: v1.1.1
creationTimestamp: "2021-05-02T09:27:02Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2021-05-02T10:55:18Z"
finalizers:
- foregroundDeletion
labels:
app: csi-attacher
longhorn.io/managed-by: longhorn-manager
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:driver.longhorn.io/kubernetes-version: {}
f:driver.longhorn.io/version: {}
f:labels:
.: {}
f:app: {}
f:longhorn.io/managed-by: {}
f:spec:
f:ports:
.: {}
k:{"port":12345,"protocol":"TCP"}:
.: {}
f:name: {}
f:port: {}
f:protocol: {}
f:targetPort: {}
f:selector:
.: {}
f:app: {}
f:sessionAffinity: {}
f:type: {}
manager: longhorn-manager
operation: Update
time: "2021-05-02T09:27:01Z"
name: csi-attacher
namespace: longhorn-system
resourceVersion: "108540171"
uid: f5ad7a41-8e0f-4e7f-a3d8-7b2aeb8d043b
spec:
clusterIP: 10.109.227.49
clusterIPs:
- 10.109.227.49
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: dummy
port: 12345
protocol: TCP
targetPort: 12345
selector:
app: csi-attacher
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
I removed the deletionTimestamp and finalizers; it seems to have made it further, so we'll see if it finishes this time =]
It didn't finish and is now back doing the same as it originally did; unfortunately I didn't catch the logs before the container restarted, so I'll need to try it again when I can keep my terminal open until it dies.
richard@nebrask:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g
2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse
W0506 16:09:09.454949 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver"
time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"
I removed the deletionTimestamp and finalizers; it seems to have made it further, so we'll see if it finishes this time =]
It's strange 🤔
You mean even you removed the deletionTimpstamp and finalizers, but the error still exists?
Is there another controller in the Kubernetes cluster that would configure the deletionTimpstamp and finalizers for Service csi-provisioner?
I removed those, then let the longhorn-driver-deployer pod run again and it gave that error, presumably putting the deletionTimestamp and finalizers back.
Is there something I can do to "kick" it while it's "Waiting for foreground deletion of service csi-provisioner"? I tried actually deleting the csi-provisioner service, but that didn't help -- and I didn't see it get recreated, so I put it back.
richard@nebrask:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g 2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse W0506 16:09:09.454949 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver" time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet" time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]" time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\"" time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet" time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground" time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground" time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground" time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner" time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"
I saw that the service csi-attacher be deleted finally. But the next error is the csi-provisioner, could you please do the same operation as you did on csi-attacher?
I did that and just keep getting the same thing for csi-provisioner
Richard
On Sun, May 9, 2021 at 11:09 PM JenTing Hsiao @.***> wrote:
@.***:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g 2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse W0506 16:09:09.454949 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver" time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet" time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]" time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: "/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00". But arg "--root-dir" not found. Hence default value will be used: "/var/lib/kubelet"" time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet" time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground" time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground" time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground" time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner" time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"
I saw that the service csi-attacher be deleted finally. But the next error is the csi-provisioner, could you please do the same operation as you did on csi-attacher?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/2559#issuecomment-836180772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABWYTRM7D5N67L2EF4YNZ3TM5TALANCNFSM44BJOQJQ .
I did an upgrade test on microk8s v1.21.0 (v1.21.0-3+121713cef81e03) with commands:
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1-rc1/deploy/longhorn.yaml
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml
However, I can't reproduce it. :thinking:
cc @meldafrawi @khushboo-rancher
I did another upgrade test on k3s v1.21.0 (v1.21.0+k3s1) with commands:
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1-rc1/deploy/longhorn.yaml
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml
Still can't reproduce it.
Hello, I'm chasing a similar problem - please let me know if I should create a separate issue. Happy to generate support bundles etc, just give me guidance.
The issue emerged after upgrading from longhorn v 1.1.2 to v 1.2.2. I actually attempted to upgrade from 1.1.2 to 1.2.0 but encountered issues related to backups disappearing, so I rolled back and waited for the 1.2.2 release.
I am running on k3s v 1.21.4 on Ubuntu 20.04 nodes in a self hosted environment.
No matter what I do - delete pods or redeploy yaml - I always end up in the same state.
ubuntu-admin@k3s-server-32:~/deployments$ k logs -f longhorn-driver-deployer-b8bcc7845-8g5cq -n longhorn-system
2021/11/16 19:07:15 proto: duplicate proto type registered: VersionResponse
W1116 19:07:15.830501 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2021-11-16T19:07:15Z" level=debug msg="Deploying CSI driver"
time="2021-11-16T19:07:16Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:17Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:18Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:19Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:20Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:21Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:22Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:23Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:24Z" level=warning msg="Proc not found: kubelet"
time="2021-11-16T19:07:24Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:25Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:26Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:27Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:28Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:29Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:30Z" level=info msg="Proc found: k3s"
time="2021-11-16T19:07:30Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-11-16T19:07:30Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-11-16T19:07:30Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-11-16T19:07:30Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted service csi-attacher"
time="2021-11-16T19:07:30Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-11-16T19:07:31Z" level=debug msg="Creating service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Created service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-11-16T19:07:44Z" level=debug msg="Creating deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Created deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-11-16T19:07:44Z" level=debug msg="Deleted service csi-provisioner"
time="2021-11-16T19:07:44Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-11-16T19:07:45Z" level=debug msg="Deleted service csi-provisioner in foreground"
time="2021-11-16T19:07:45Z" level=debug msg="Creating service csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Created service csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Deleting existing deployment csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Deleted deployment csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Waiting for foreground deletion of deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Deleted deployment csi-provisioner in foreground"
time="2021-11-16T19:07:55Z" level=debug msg="Creating deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Created deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Deleting existing service csi-resizer"
time="2021-11-16T19:07:55Z" level=debug msg="Deleted service csi-resizer"
time="2021-11-16T19:07:55Z" level=debug msg="Waiting for foreground deletion of service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleted service csi-resizer in foreground"
time="2021-11-16T19:07:56Z" level=debug msg="Creating service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Created service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleting existing deployment csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleted deployment csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Waiting for foreground deletion of deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted deployment csi-resizer in foreground"
time="2021-11-16T19:08:05Z" level=debug msg="Creating deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Created deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Deleting existing service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Waiting for foreground deletion of service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted service csi-snapshotter in foreground"
time="2021-11-16T19:08:05Z" level=debug msg="Creating service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Created service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Waiting for foreground deletion of deployment csi-snapshotter"
time="2021-11-16T19:10:07Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy deployment csi-snapshotter: failed to cleanup deployment csi-snapshotter: Foreground deletion of deployment csi-snapshotter timed out"
Upgraded my cluster to k3s v1.21.5 and the deployment completed.
So I no longer have an issue, but leaving my comment above for posterity.
Thank you @riazarbi It looks like Kubernetes version issue?
@taxilian Had you fix this issue in the end?
ultimately I had too many issues with longhorn; lots I like about it, but it wasn't reliable enough for my needs so I've switched to rook/ceph.
@taxilian Appreciate your feedback! It is sad to see you are leaving. Reliability is our top priority and we will continue working toward this goal.
I encountered a similar issue with v1.3.1 on k3s 1.24. It's possible that having FluxCD set to automatically upgrade the Helm chart broke something (I'd originally started with 1.2.3).
Eventually I deleted all of the csi resources one-by-one using kubectl until the deployer managed to finish. I ended up having to delete the deployments, replicasets and pods of the csi-attacher, csi-provisioner, csi-resizer and csi-snapshotter individually since the deletion wasn't cascading from some reason. Then I deleted the associated services and finally the longhorn-csi-plugin daemonset and pods (again, the cascading delete was somehow broken).
cc @mantissahz Can we put this ticket in the community meeting tomorrow?