longhorn icon indicating copy to clipboard operation
longhorn copied to clipboard

[BUG] "failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"

Open taxilian opened this issue 4 years ago • 21 comments

Describe the bug After upgrading to v1.1.1 I am now having the longhorn-driver-deployer pod continually crash, never finishing.

To Reproduce

Sadly, I don't know of a specific way to reproduce this; I was having issues with my cluster when I upgraded, which didn't show up until after I started the upgrade; my calico configuration was using ip addresses from the wrong interface which caused some really odd issues throughout the cluster and I ended up force killing some of the pods. I suspect that contributed to getting into this state, but I have no idea how to resolve the issue at this point.

Expected behavior

Obviously the driver deployer should not crash and should complete as expected =] Since I don't actually know what it does I'm not sure what that means, really.

Log

2021/05/03 18:27:36 proto: duplicate proto type registered: VersionResponse
W0503 18:27:36.570867       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-05-03T18:27:36Z" level=debug msg="Deploying CSI driver"
time="2021-05-03T18:27:36Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:37Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-03T18:27:38Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-03T18:27:39Z" level=info msg="Proc found: kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-03T18:27:39Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-03T18:27:39Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-03T18:27:39Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-03T18:27:39Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-03T18:27:39Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-03T18:27:39Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-03T18:29:40Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"

Environment:

  • Longhorn version: v1.1.1 (coming from v1.1.1-rc1)
  • Installation method: kubectl
  • Kubernetes distro: kubeadm, v1.21.0
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 4 (plus the 3 management nodes which are also worker nodes)
  • Node config
    • OS type and version: Ubuntu 20.04.2 LTS
    • CPU per node: varies
    • Memory per node: varies
    • Disk type(e.g. SSD/NVMe): SSD/NVMe, some magnetic in RAID 0 (which work well, btw)
    • Network bandwidth between the nodes: 10GBe
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: ~15

Additional context

Not sure what else would be useful; I am available in the slack channel for discussion if that would help. I have things finally stable now, but would really like to have that fixed as I don't know what issues it will cause.

taxilian avatar May 03 '21 18:05 taxilian

From the log, it seems like it can't reach the Sevice csi-attacher.

  • Are you able to use kubectl to check the Sevice csi-attacher exists?
  • Moreover, is the longhorn-manager Pod able to access the Service csi-attacher? Like login to one of the longhorn-manager Pod and run nslookup csi-attacher. I'm worried that your Kubernetes cluster network configuration is not correct.

jenting avatar May 04 '21 02:05 jenting

$ k -n longhorn-system get pods -l app=csi-attacher
NAME                            READY   STATUS    RESTARTS   AGE
csi-attacher-869cccc7c9-9mn7l   1/1     Running   0          46h
csi-attacher-869cccc7c9-hgfnx   1/1     Running   0          46h
csi-attacher-869cccc7c9-zx6lr   1/1     Running   0          46h
$ kubectl -n longhorn-system exec -it longhorn-manager-7x9z2 -- bash
root@longhorn-manager-7x9z2:/# nslookup csi-attacher
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	csi-attacher.longhorn-system.svc.cluster.local
Address: 10.109.227.49

root@longhorn-manager-7x9z2:/# curl http://csi-attacher:12345
curl: (7) Failed to connect to csi-attacher port 12345: Connection refused

.. of course, the service names that port "dummy" so I'm guessing that doesn't actually mean anything?

More to the point on the DNS, though, the failing pod can also access it (if I catch it before it fails):

$ kubectl -n longhorn-system exec -it longhorn-driver-deployer-6c945db7f6-mrkgq -- bash
root@longhorn-driver-deployer-6c945db7f6-mrkgq:/# nslookup csi-attacher
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	csi-attacher.longhorn-system.svc.cluster.local
Address: 10.109.227.49

taxilian avatar May 04 '21 18:05 taxilian

Could you please get the yaml manifest of Service csi-attacher, I'd like to check the deletionTimestamp metadata is set or not. Ref to:

  • https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#foreground-cascading-deletion
  • https://github.com/longhorn/longhorn-manager/blob/v1.1.1/csi/deployment_util.go#L261-L263

Once the "deletion in progress" state is set, the garbage collector deletes the object's dependents. Once the garbage collector has deleted all "blocking" dependents (objects with ownerReference.blockOwnerDeletion=true), it deletes the owner object.

I'm worried that the dependency object can't be deleted within 120 secs, so the time out triggered.

jenting avatar May 05 '21 02:05 jenting

apiVersion: v1
kind: Service
metadata:
  annotations:
    driver.longhorn.io/kubernetes-version: v1.20.5
    driver.longhorn.io/version: v1.1.1
  creationTimestamp: "2021-05-02T09:27:02Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-05-02T10:55:18Z"
  finalizers:
  - foregroundDeletion
  labels:
    app: csi-attacher
    longhorn.io/managed-by: longhorn-manager
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:driver.longhorn.io/kubernetes-version: {}
          f:driver.longhorn.io/version: {}
        f:labels:
          .: {}
          f:app: {}
          f:longhorn.io/managed-by: {}
      f:spec:
        f:ports:
          .: {}
          k:{"port":12345,"protocol":"TCP"}:
            .: {}
            f:name: {}
            f:port: {}
            f:protocol: {}
            f:targetPort: {}
        f:selector:
          .: {}
          f:app: {}
        f:sessionAffinity: {}
        f:type: {}
    manager: longhorn-manager
    operation: Update
    time: "2021-05-02T09:27:01Z"
  name: csi-attacher
  namespace: longhorn-system
  resourceVersion: "108540171"
  uid: f5ad7a41-8e0f-4e7f-a3d8-7b2aeb8d043b
spec:
  clusterIP: 10.109.227.49
  clusterIPs:
  - 10.109.227.49
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dummy
    port: 12345
    protocol: TCP
    targetPort: 12345
  selector:
    app: csi-attacher
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

taxilian avatar May 05 '21 04:05 taxilian

I removed the deletionTimestamp and finalizers; it seems to have made it further, so we'll see if it finishes this time =]

taxilian avatar May 05 '21 04:05 taxilian

It didn't finish and is now back doing the same as it originally did; unfortunately I didn't catch the logs before the container restarted, so I'll need to try it again when I can keep my terminal open until it dies.

taxilian avatar May 06 '21 06:05 taxilian

richard@nebrask:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g
2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse
W0506 16:09:09.454949       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver"
time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"

taxilian avatar May 06 '21 16:05 taxilian

I removed the deletionTimestamp and finalizers; it seems to have made it further, so we'll see if it finishes this time =]

It's strange 🤔 You mean even you removed the deletionTimpstamp and finalizers, but the error still exists?

Is there another controller in the Kubernetes cluster that would configure the deletionTimpstamp and finalizers for Service csi-provisioner?

jenting avatar May 07 '21 06:05 jenting

I removed those, then let the longhorn-driver-deployer pod run again and it gave that error, presumably putting the deletionTimestamp and finalizers back.

Is there something I can do to "kick" it while it's "Waiting for foreground deletion of service csi-provisioner"? I tried actually deleting the csi-provisioner service, but that didn't help -- and I didn't see it get recreated, so I put it back.

taxilian avatar May 07 '21 14:05 taxilian

richard@nebrask:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g
2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse
W0506 16:09:09.454949       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver"
time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]"
time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher"
time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher"
time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner"
time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"

I saw that the service csi-attacher be deleted finally. But the next error is the csi-provisioner, could you please do the same operation as you did on csi-attacher?

jenting avatar May 10 '21 05:05 jenting

I did that and just keep getting the same thing for csi-provisioner

Richard

On Sun, May 9, 2021 at 11:09 PM JenTing Hsiao @.***> wrote:

@.***:~$ k -n longhorn-system logs -f longhorn-driver-deployer-6c945db7f6-gtf2g 2021/05/06 16:09:09 proto: duplicate proto type registered: VersionResponse W0506 16:09:09.454949 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. time="2021-05-06T16:09:09Z" level=debug msg="Deploying CSI driver" time="2021-05-06T16:09:09Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:10Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending" time="2021-05-06T16:09:11Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:12Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running" time="2021-05-06T16:09:13Z" level=info msg="Proc found: kubelet" time="2021-05-06T16:09:13Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock ]" time="2021-05-06T16:09:13Z" level=warning msg="Cmdline of proc kubelet found: "/usr/bin/kubelet\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--config=/var/lib/kubelet/config.yaml\x00--container-runtime=remote\x00--container-runtime-endpoint=/run/containerd/containerd.sock\x00". But arg "--root-dir" not found. Hence default value will be used: "/var/lib/kubelet"" time="2021-05-06T16:09:13Z" level=info msg="Detected root dir path: /var/lib/kubelet" time="2021-05-06T16:09:13Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground" time="2021-05-06T16:09:13Z" level=debug msg="Creating CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Created CSI Driver driver.longhorn.io" time="2021-05-06T16:09:13Z" level=debug msg="Deleting existing service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Deleted service csi-attacher" time="2021-05-06T16:09:13Z" level=debug msg="Waiting for foreground deletion of service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted service csi-attacher in foreground" time="2021-05-06T16:09:53Z" level=debug msg="Creating service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Created service csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleting existing deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Deleted deployment csi-attacher" time="2021-05-06T16:09:53Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleted deployment csi-attacher in foreground" time="2021-05-06T16:10:05Z" level=debug msg="Creating deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Created deployment csi-attacher" time="2021-05-06T16:10:05Z" level=debug msg="Deleting existing service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Deleted service csi-provisioner" time="2021-05-06T16:10:05Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner" time="2021-05-06T16:12:06Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy service csi-provisioner: failed to cleanup service csi-provisioner: Foreground deletion of service csi-provisioner timed out"

I saw that the service csi-attacher be deleted finally. But the next error is the csi-provisioner, could you please do the same operation as you did on csi-attacher?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/2559#issuecomment-836180772, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABWYTRM7D5N67L2EF4YNZ3TM5TALANCNFSM44BJOQJQ .

taxilian avatar May 10 '21 18:05 taxilian

I did an upgrade test on microk8s v1.21.0 (v1.21.0-3+121713cef81e03) with commands:

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1-rc1/deploy/longhorn.yaml
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml

However, I can't reproduce it. :thinking:

jenting avatar May 11 '21 06:05 jenting

cc @meldafrawi @khushboo-rancher

innobead avatar May 11 '21 07:05 innobead

I did another upgrade test on k3s v1.21.0 (v1.21.0+k3s1) with commands:

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1-rc1/deploy/longhorn.yaml
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml

Still can't reproduce it.

jenting avatar May 11 '21 08:05 jenting

Hello, I'm chasing a similar problem - please let me know if I should create a separate issue. Happy to generate support bundles etc, just give me guidance.

The issue emerged after upgrading from longhorn v 1.1.2 to v 1.2.2. I actually attempted to upgrade from 1.1.2 to 1.2.0 but encountered issues related to backups disappearing, so I rolled back and waited for the 1.2.2 release.

I am running on k3s v 1.21.4 on Ubuntu 20.04 nodes in a self hosted environment.

No matter what I do - delete pods or redeploy yaml - I always end up in the same state.

ubuntu-admin@k3s-server-32:~/deployments$ k logs -f longhorn-driver-deployer-b8bcc7845-8g5cq -n longhorn-system
2021/11/16 19:07:15 proto: duplicate proto type registered: VersionResponse
W1116 19:07:15.830501       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2021-11-16T19:07:15Z" level=debug msg="Deploying CSI driver"
time="2021-11-16T19:07:16Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:17Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:18Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:19Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:20Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2021-11-16T19:07:21Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:22Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:23Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2021-11-16T19:07:24Z" level=warning msg="Proc not found: kubelet"
time="2021-11-16T19:07:24Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:25Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:26Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Pending"
time="2021-11-16T19:07:27Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:28Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:29Z" level=debug msg="proc cmdline detection pod discover-proc-k3s-cmdline in phase: Running"
time="2021-11-16T19:07:30Z" level=info msg="Proc found: k3s"
time="2021-11-16T19:07:30Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2021-11-16T19:07:30Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2021-11-16T19:07:30Z" level=debug msg="Deleting existing CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Waiting for foreground deletion of CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted CSI Driver driver.longhorn.io in foreground"
time="2021-11-16T19:07:30Z" level=debug msg="Creating CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Created CSI Driver driver.longhorn.io"
time="2021-11-16T19:07:30Z" level=debug msg="Deleting existing service csi-attacher"
time="2021-11-16T19:07:30Z" level=debug msg="Deleted service csi-attacher"
time="2021-11-16T19:07:30Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleted service csi-attacher in foreground"
time="2021-11-16T19:07:31Z" level=debug msg="Creating service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Created service csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleting existing deployment csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Deleted deployment csi-attacher"
time="2021-11-16T19:07:31Z" level=debug msg="Waiting for foreground deletion of deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Deleted deployment csi-attacher in foreground"
time="2021-11-16T19:07:44Z" level=debug msg="Creating deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Created deployment csi-attacher"
time="2021-11-16T19:07:44Z" level=debug msg="Deleting existing service csi-provisioner"
time="2021-11-16T19:07:44Z" level=debug msg="Deleted service csi-provisioner"
time="2021-11-16T19:07:44Z" level=debug msg="Waiting for foreground deletion of service csi-provisioner"
time="2021-11-16T19:07:45Z" level=debug msg="Deleted service csi-provisioner in foreground"
time="2021-11-16T19:07:45Z" level=debug msg="Creating service csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Created service csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Deleting existing deployment csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Deleted deployment csi-provisioner"
time="2021-11-16T19:07:46Z" level=debug msg="Waiting for foreground deletion of deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Deleted deployment csi-provisioner in foreground"
time="2021-11-16T19:07:55Z" level=debug msg="Creating deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Created deployment csi-provisioner"
time="2021-11-16T19:07:55Z" level=debug msg="Deleting existing service csi-resizer"
time="2021-11-16T19:07:55Z" level=debug msg="Deleted service csi-resizer"
time="2021-11-16T19:07:55Z" level=debug msg="Waiting for foreground deletion of service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleted service csi-resizer in foreground"
time="2021-11-16T19:07:56Z" level=debug msg="Creating service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Created service csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleting existing deployment csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Deleted deployment csi-resizer"
time="2021-11-16T19:07:56Z" level=debug msg="Waiting for foreground deletion of deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted deployment csi-resizer in foreground"
time="2021-11-16T19:08:05Z" level=debug msg="Creating deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Created deployment csi-resizer"
time="2021-11-16T19:08:05Z" level=debug msg="Deleting existing service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Waiting for foreground deletion of service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Deleted service csi-snapshotter in foreground"
time="2021-11-16T19:08:05Z" level=debug msg="Creating service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Created service csi-snapshotter"
time="2021-11-16T19:08:05Z" level=debug msg="Waiting for foreground deletion of deployment csi-snapshotter"
time="2021-11-16T19:10:07Z" level=fatal msg="Error deploying driver: failed to start CSI driver: failed to deploy deployment csi-snapshotter: failed to cleanup deployment csi-snapshotter: Foreground deletion of deployment csi-snapshotter timed out"

riazarbi avatar Nov 16 '21 19:11 riazarbi

Upgraded my cluster to k3s v1.21.5 and the deployment completed.

So I no longer have an issue, but leaving my comment above for posterity.

riazarbi avatar Nov 16 '21 21:11 riazarbi

Thank you @riazarbi It looks like Kubernetes version issue?

@taxilian Had you fix this issue in the end?

jenting avatar Nov 17 '21 05:11 jenting

ultimately I had too many issues with longhorn; lots I like about it, but it wasn't reliable enough for my needs so I've switched to rook/ceph.

taxilian avatar Nov 17 '21 15:11 taxilian

@taxilian Appreciate your feedback! It is sad to see you are leaving. Reliability is our top priority and we will continue working toward this goal.

PhanLe1010 avatar Nov 17 '21 23:11 PhanLe1010

I encountered a similar issue with v1.3.1 on k3s 1.24. It's possible that having FluxCD set to automatically upgrade the Helm chart broke something (I'd originally started with 1.2.3).

Eventually I deleted all of the csi resources one-by-one using kubectl until the deployer managed to finish. I ended up having to delete the deployments, replicasets and pods of the csi-attacher, csi-provisioner, csi-resizer and csi-snapshotter individually since the deletion wasn't cascading from some reason. Then I deleted the associated services and finally the longhorn-csi-plugin daemonset and pods (again, the cascading delete was somehow broken).

GJKrupa avatar Sep 07 '22 19:09 GJKrupa

cc @mantissahz Can we put this ticket in the community meeting tomorrow?

PhanLe1010 avatar Sep 07 '22 20:09 PhanLe1010