k8s-rdma-shared-dev-plugin
k8s-rdma-shared-dev-plugin copied to clipboard
Post deployment of k8s-rdma-shared-dev-plugin (v1.5.2), rdma pods are in CrashLoopBackOff state
I am trying to deploy Kubernetes Plugin for RoCE NIC. For this, I have deployed Kubernetes using Kubespray Playbook and post deployment all pods are up and running.
Post deployment, I am trying to deploy the RoCE Plugin as part of which, k8s-rdma-shared-dev-plugin is being installed. However, after deployment, the rdma pods are stuck in a CrashLoopBackOff state and the roce pods are in pending state as shown in below screenshot:
I ran kubectl logs <podman> -n <namespace command to check the logs and found the below error:
I found a workaround for this and followed the below steps:
- Edit the rdma daemonset by add a volume mount for pci-id file
Add the following mountPath under the volumeMounts section:
- name: pci-ids mountPath: /usr/share/misc/pci.ids readOnly: true
Add the following under volumes section:
- name: pci-ids hostPath: path: /usr/share/misc/pci.ids type: File
Post this, the CrashLoop issue was resolved and all pods including rdma and roce, came up as running.
Is this a known issue and any fix which is available for the same? If a fix is not present currently, is the above workaround fine?
Environment Details:
Kubespray: v2.27.0 Kubernetes : v1.31.4 rdma plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git (v1.5.2)
Hey, seems you are using master version. It has been fixed in #152
Thanks for the update! Since, I'm using the tagged branch (v1.5.2) and the changes won't be available, can I go with the above workaround without impacting anything else?
Are you sure you are in the tagged branch? The Docker file in that tag is OK: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/v1.5.2/Dockerfile#L14