longhorn
longhorn copied to clipboard
[BUG]: kubelet was restarted this probably caused Longhorn issues and nodes started misbehaving probably due to broken longhorn mounts
kubelet was restarted
this probably caused Longhorn issues
nodes started misbehaving probably due to broken longhorn mounts
df -h was hanging on those nodes
inspecting the output of mount we found lines starting with
172.28.130.22:/pvc-f67505ea-de07-4627-873c-6b5605911f00 ...
this IP was the Service IP of the share-manager-pvc Pod for that PVC (RWX)
the pod was failing with message that volume can't be mounted
we followed procedure to stop all Pods using that PVC/volume, also Terminated Pods needed to be stopped
we then tried to Attach the volume via the UI but it was not working
we then started the Pods again and as soon as the first Pod mounted the PVC all the remaining Longhorn Pods which were failing we suddenly healthy again
in addition we faced an Pod event error that the CSI driver was not available on some nodes, a restart of the csi plugin daemon set fixed that
r.longhorn.io" > " time="2022-04-29T10:59:08Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-f67505ea-de07-4627-873c-6b5605911f00 share should be available before the mount" time="2022-04-29T11:01:10Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume request: {"target_path":"/var/lib/kubelet/pods/ba8f1931-f342-464d-bb4c-36fdead251a2/volumes/kubernetes.io~csi/pvc-f67505ea-de07-4627-873c-6b5605911f00/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"baseImage":"","csi.storage.k8s.io/ephemeral":"false","csi.storage.k8s.io/pod.name":"pdfgen-preprocessor-datp-app-5ddcfbd845-98rwl","csi.storage.k8s.io/pod.namespace":"datp-prod","csi.storage.k8s.io/pod.uid":"ba8f1931-f342-464d-bb4c-36fdead251a2","csi.storage.k8s.io/serviceAccount.name":"datp-spring-k8s","fromBackup":"","numberOfReplicas":"3","share":"true","staleReplicaTimeout":"30","storage.kubernetes.io/csiProvisionerIdentity":"1635437984929-8081-driver.longhorn.io"},"volume_id":"pvc-f67505ea-de07-4627-873c-6b5605911f00"}" time="2022-04-29T11:01:10Z" level=info msg="NodeServer NodePublishVolume req: volume_id:"pvc-f67505ea-de07-4627-873c-6b5605911f00" target_path:"/var/lib/kubelet/pods/ba8f1931-f342-464d-bb4c-36fdead251a2/volumes/kubernetes.io~csi/pvc-f67505ea-de07-4627-873c-6b5605911f00/mount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"baseImage" value:"" > volume_context:<key:"csi.storage.k8s.io/ephemeral" value:"false" > volume_context:<key:"csi.storage.k8s.io/pod.name" value:"pdfgen-preprocessor-datp-app-5ddcfbd845-98rwl" > volume_context:<key:"csi.storage.k8s.io/pod.namespace" value:"datp-prod" > volume_context:<key:"csi.storage.k8s.io/pod.uid" value:"ba8f1931-f342-464d-bb4c-36fdead251a2" > volume_context:<key:"csi.storage.k8s.io/serviceAccount.name" value:"datp-spring-k8s" > volume_context:<key:"fromBackup" value:"" > volume_context:<key:"numberOfReplicas" value:"3" > volume_context:<key:"share" value:"true" > volume_context:<key:"staleReplicaTimeout" value:"30" > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1635437984929-8081-driver.longhorn.io" > " time="2022-04-29T11:01:10Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-f67505ea-de07-4627-873c-6b5605911f00 share should be available before the mount"
F5153JV@LFDC5RNN3D3 MINGW64 /c/kubernetes $ kubectl logs longhorn-csi-plugin-jhxdp -c longhorn-csi-plugin -n longhorn 2022/05/07 03:18:46 proto: duplicate proto type registered: VersionResponse time="2022-05-07T03:18:46Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.2.2, manager URL http://longhorn-backend:9500/v1" time="2022-05-07T03:18:46Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get "http://longhorn-backend:9500/v1": dial tcp 172.26.207.18:9500: connect: no route to host"
@innobead if i upgrade longhorn to 1.2.4 , kubelet restart and pdb issues will fix? Because i have faced issues in production also
@sharanbabumg
In Longhorn v1.2.4, when Kuberlet restarts, Longhorn doesn't kill the instance-manager-xxx pods. This should in theory help your case. Please let us know if you have feedback after upgrading to Longhorn v1.2.4
@PhanLe1010 @innobead I have upgraded longhorn in dev and uat on prem cluster, but i still see rpc error code issue exists for that again i have restarted longhorn-csi-plugin.
Do we have any permanent solution to fix this issue or not?
Because i dont want to take risk and upgrade longhorn in production clusters. Can i get an update asap.
@sharanbabumg It is weird. Can you provide us:
- Reproduce steps
- Your env details
- Longhorn version: - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: - Number of management node in the cluster: - Number of worker node in the cluster: - Node config - OS type and version: - CPU per node: - Memory per node: - Disk type(e.g. SSD/NVMe): - Network bandwidth between the nodes: - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): - Number of Longhorn volumes in the cluster:
- Reproduce the problem
- When this problem is happening. Take support bundle
- send us at
[email protected]
- exec into one of the
longhorn-csi-plugin-xxx
and the containerlonghorn-csi-plugin
pods, do acurl http://longhorn-backend:9500/v1
exec intolonghorn-driver-deployer-xxx
pod, do acurl http://longhorn-backend:9500/v1
Note: we might want to take a look at this project to allow kubelet to restart the longhorn csi plugin
@PhanLe1010 please find the details requested.
-
Longhorn version: 1.2.4
-
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): argocd deployment
-
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: on-prem kubernetes
- Number of management node in the cluster: 3 nodes
- Number of worker node in the cluster: 17 nodes
-
Node config
- OS type and version: few servers are 7.9 and few are 8.5
NAME="Oracle Linux Server" VERSION="7.9" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.9" PRETTY_NAME="Oracle Linux Server 7.9" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:7:9:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://bugzilla.oracle.com/" ORACLE_BUGZILLA_PRODUCT="Oracle Linux 7" ORACLE_BUGZILLA_PRODUCT_VERSION=7.9 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=7.9
NAME="Oracle Linux Server" VERSION="8.5" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="8.5" PLATFORM_ID="platform:el8" PRETTY_NAME="Oracle Linux Server 8.5" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:8:5:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://bugzilla.oracle.com/" ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8" ORACLE_BUGZILLA_PRODUCT_VERSION=8.5 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=8.5
- CPU per node: 8/10/12 cores
- Memory per node:
MemTotal: 74063632 kB MemFree: 33471788 kB
MemTotal: 98836068 kB MemFree: 63611904 kB
MemTotal: 78193264 kB MemFree: 12078664 kB
-
Disk type(e.g. SSD/NVMe): SSD
-
Network bandwidth between the nodes:
-
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Vmware
-
Number of Longhorn volumes in the cluster: 38 volumes
cc @shuo-wu
The fix for the kubelet restart #2650 is mainly for RWO volumes.
As for the RWX volume, there is another ticket tracking it, which maybe related to this issue: https://github.com/longhorn/longhorn/issues/3612
cc @derekbit
@innobead @shuo-wu Can i get some update?
@innobead @shuo-wu Can i get some update on this request? Or else may i know iscsi/nfs-client package only required or else nfs and iscsi service also need to be running and one more point iscsi installation and nfs installation daemonset pods should also be running on the cluster?
This one may be resolved after launching HA RWX.
Or we can check if the kubelet restart would lead to the share manager pods down then try to avoid it.
cc @derekbit