k8s-rdma-shared-dev-plugin Post deployment of k8s-rdma-shared-dev-plugin (v1.5.2), rdma pods are in CrashLoopBackOff state

Post deployment of k8s-rdma-shared-dev-plugin (v1.5.2), rdma pods are in CrashLoopBackOff state

Open VrindaMarwah opened this issue 8 months ago • 3 comments

I am trying to deploy Kubernetes Plugin for RoCE NIC. For this, I have deployed Kubernetes using Kubespray Playbook and post deployment all pods are up and running.

Post deployment, I am trying to deploy the RoCE Plugin as part of which, k8s-rdma-shared-dev-plugin is being installed. However, after deployment, the rdma pods are stuck in a CrashLoopBackOff state and the roce pods are in pending state as shown in below screenshot:

I ran kubectl logs <podman> -n <namespace command to check the logs and found the below error:

I found a workaround for this and followed the below steps:

Edit the rdma daemonset by add a volume mount for pci-id file

Add the following mountPath under the volumeMounts section:

name: pci-ids mountPath: /usr/share/misc/pci.ids readOnly: true

Add the following under volumes section:

name: pci-ids hostPath: path: /usr/share/misc/pci.ids type: File

Post this, the CrashLoop issue was resolved and all pods including rdma and roce, came up as running.

Is this a known issue and any fix which is available for the same? If a fix is not present currently, is the above workaround fine?

Environment Details:

Kubespray: v2.27.0 Kubernetes : v1.31.4 rdma plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git (v1.5.2)

Mar 07 '25 05:03 VrindaMarwah

Hey, seems you are using master version. It has been fixed in #152

Mar 09 '25 11:03 rollandf

Thanks for the update! Since, I'm using the tagged branch (v1.5.2) and the changes won't be available, can I go with the above workaround without impacting anything else?

Mar 10 '25 13:03 VrindaMarwah

Are you sure you are in the tagged branch? The Docker file in that tag is OK: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/v1.5.2/Dockerfile#L14

Mar 10 '25 15:03 rollandf

k8s-rdma-shared-dev-plugin k8s-rdma-shared-dev-plugin copied to clipboard

Post deployment of k8s-rdma-shared-dev-plugin (v1.5.2), rdma pods are in CrashLoopBackOff state

Environment Details:

k8s-rdma-shared-dev-plugin
k8s-rdma-shared-dev-plugin copied to clipboard