longhorn icon indicating copy to clipboard operation
longhorn copied to clipboard

[BUG]: kubelet was restarted this probably caused Longhorn issues and nodes started misbehaving probably due to broken longhorn mounts

Open sharanbabumg opened this issue 2 years ago • 13 comments

kubelet was restarted

this probably caused Longhorn issues

nodes started misbehaving probably due to broken longhorn mounts

df -h was hanging on those nodes

inspecting the output of mount we found lines starting with

172.28.130.22:/pvc-f67505ea-de07-4627-873c-6b5605911f00 ...

this IP was the Service IP of the share-manager-pvc Pod for that PVC (RWX)

the pod was failing with message that volume can't be mounted

we followed procedure to stop all Pods using that PVC/volume, also Terminated Pods needed to be stopped

we then tried to Attach the volume via the UI but it was not working

we then started the Pods again and as soon as the first Pod mounted the PVC all the remaining Longhorn Pods which were failing we suddenly healthy again

in addition we faced an Pod event error that the CSI driver was not available on some nodes, a restart of the csi plugin daemon set fixed that

r.longhorn.io" > " time="2022-04-29T10:59:08Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-f67505ea-de07-4627-873c-6b5605911f00 share should be available before the mount" time="2022-04-29T11:01:10Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume request: {"target_path":"/var/lib/kubelet/pods/ba8f1931-f342-464d-bb4c-36fdead251a2/volumes/kubernetes.io~csi/pvc-f67505ea-de07-4627-873c-6b5605911f00/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"baseImage":"","csi.storage.k8s.io/ephemeral":"false","csi.storage.k8s.io/pod.name":"pdfgen-preprocessor-datp-app-5ddcfbd845-98rwl","csi.storage.k8s.io/pod.namespace":"datp-prod","csi.storage.k8s.io/pod.uid":"ba8f1931-f342-464d-bb4c-36fdead251a2","csi.storage.k8s.io/serviceAccount.name":"datp-spring-k8s","fromBackup":"","numberOfReplicas":"3","share":"true","staleReplicaTimeout":"30","storage.kubernetes.io/csiProvisionerIdentity":"1635437984929-8081-driver.longhorn.io"},"volume_id":"pvc-f67505ea-de07-4627-873c-6b5605911f00"}" time="2022-04-29T11:01:10Z" level=info msg="NodeServer NodePublishVolume req: volume_id:"pvc-f67505ea-de07-4627-873c-6b5605911f00" target_path:"/var/lib/kubelet/pods/ba8f1931-f342-464d-bb4c-36fdead251a2/volumes/kubernetes.io~csi/pvc-f67505ea-de07-4627-873c-6b5605911f00/mount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"baseImage" value:"" > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"false" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"pdfgen-preprocessor-datp-app-5ddcfbd845-98rwl" > volume_context:<key:"csi.storage.k8s.io/pod.namespace" value:"datp-prod" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"ba8f1931-f342-464d-bb4c-36fdead251a2" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"datp-spring-k8s" > volume_context:<key:"fromBackup" value:"" > volume_context:<key:"numberOfReplicas" value:"3" > volume_context:<key:"share" value:"true" > volume_context:<key:"staleReplicaTimeout" value:"30" > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1635437984929-8081-driver.longhorn.io" > " time="2022-04-29T11:01:10Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-f67505ea-de07-4627-873c-6b5605911f00 share should be available before the mount"

F5153JV@LFDC5RNN3D3 MINGW64 /c/kubernetes $ kubectl logs longhorn-csi-plugin-jhxdp -c longhorn-csi-plugin -n longhorn 2022/05/07 03:18:46 proto: duplicate proto type registered: VersionResponse time="2022-05-07T03:18:46Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.2.2, manager URL http://longhorn-backend:9500/v1" time="2022-05-07T03:18:46Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get "http://longhorn-backend:9500/v1": dial tcp 172.26.207.18:9500: connect: no route to host"

sharanbabumg avatar May 10 '22 06:05 sharanbabumg

Have you tried v1.2.4? we have some improvements for kubelet restart.

ref: #3644

innobead avatar May 10 '22 07:05 innobead

@innobead if i upgrade longhorn to 1.2.4 , kubelet restart and pdb issues will fix? Because i have faced issues in production also

sharanbabumg avatar May 16 '22 08:05 sharanbabumg

@sharanbabumg

In Longhorn v1.2.4, when Kuberlet restarts, Longhorn doesn't kill the instance-manager-xxx pods. This should in theory help your case. Please let us know if you have feedback after upgrading to Longhorn v1.2.4

PhanLe1010 avatar May 25 '22 21:05 PhanLe1010

@PhanLe1010 @innobead I have upgraded longhorn in dev and uat on prem cluster, but i still see rpc error code issue exists for that again i have restarted longhorn-csi-plugin.

Do we have any permanent solution to fix this issue or not?

Because i dont want to take risk and upgrade longhorn in production clusters. Can i get an update asap.

sharanbabumg avatar May 26 '22 14:05 sharanbabumg

@sharanbabumg It is weird. Can you provide us:


  • Reproduce steps
  • Your env details
     - Longhorn version:
     - Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
     - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
       - Number of management node in the cluster:
       - Number of worker node in the cluster:
     - Node config
       - OS type and version:
       - CPU per node:
       - Memory per node:
       - Disk type(e.g. SSD/NVMe):
       - Network bandwidth between the nodes:
     - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
     - Number of Longhorn volumes in the cluster:
    

  • Reproduce the problem
  • When this problem is happening. Take support bundle
  • send us at [email protected]
  • exec into one of the longhorn-csi-plugin-xxx and the container longhorn-csi-plugin pods, do a curl http://longhorn-backend:9500/v1 exec into longhorn-driver-deployer-xxx pod, do a curl http://longhorn-backend:9500/v1

PhanLe1010 avatar May 27 '22 21:05 PhanLe1010

Note: we might want to take a look at this project to allow kubelet to restart the longhorn csi plugin

PhanLe1010 avatar May 31 '22 23:05 PhanLe1010

@PhanLe1010 please find the details requested.

  • Longhorn version: 1.2.4

  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): argocd deployment

  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: on-prem kubernetes

    • Number of management node in the cluster: 3 nodes
    • Number of worker node in the cluster: 17 nodes
  • Node config

    • OS type and version: few servers are 7.9 and few are 8.5

    NAME="Oracle Linux Server" VERSION="7.9" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.9" PRETTY_NAME="Oracle Linux Server 7.9" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:7:9:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://bugzilla.oracle.com/" ORACLE_BUGZILLA_PRODUCT="Oracle Linux 7" ORACLE_BUGZILLA_PRODUCT_VERSION=7.9 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=7.9

    NAME="Oracle Linux Server" VERSION="8.5" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Oracle Linux Server 8.5" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:8:5:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://bugzilla.oracle.com/" ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8" ORACLE_BUGZILLA_PRODUCT_VERSION=8.5 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=8.5

    • CPU per node: 8/10/12 cores
    • Memory per node:

    MemTotal: 74063632 kB MemFree: 33471788 kB

    MemTotal: 98836068 kB MemFree: 63611904 kB

    MemTotal: 78193264 kB MemFree: 12078664 kB

  • Disk type(e.g. SSD/NVMe): SSD

  • Network bandwidth between the nodes:

  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Vmware

  • Number of Longhorn volumes in the cluster: 38 volumes

sharanbabumg avatar Jun 06 '22 06:06 sharanbabumg

cc @shuo-wu

innobead avatar Jun 06 '22 06:06 innobead

The fix for the kubelet restart #2650 is mainly for RWO volumes.

As for the RWX volume, there is another ticket tracking it, which maybe related to this issue: https://github.com/longhorn/longhorn/issues/3612

shuo-wu avatar Jun 07 '22 10:06 shuo-wu

cc @derekbit

innobead avatar Jun 07 '22 12:06 innobead

@innobead @shuo-wu Can i get some update?

sharanbabumg avatar Jun 10 '22 07:06 sharanbabumg

@innobead @shuo-wu Can i get some update on this request? Or else may i know iscsi/nfs-client package only required or else nfs and iscsi service also need to be running and one more point iscsi installation and nfs installation daemonset pods should also be running on the cluster?

sharanbabumg avatar Aug 03 '22 11:08 sharanbabumg

This one may be resolved after launching HA RWX.

Or we can check if the kubelet restart would lead to the share manager pods down then try to avoid it.

cc @derekbit

shuo-wu avatar Aug 09 '22 03:08 shuo-wu