k8s-rdma-shared-dev-plugin
k8s-rdma-shared-dev-plugin copied to clipboard
This plugin not working when used IB NIC the LINK_TYPE_P1=ETH!
This plugin not working when used IB NIC the LINK_TYPE_P1=ETH!
mlxconfig -d <DIVICE_INFO> query | grep LINK_TYPE_P1
mlxconfig -d <DIVICE_INFO> set LINK_TYPE_P1=1
reboot
The IB NIC change ETH model. run the rdma-shared-dev-plugin in k8s cluster.
The node Capacity and Allocatable resourceName values is 0.
NIC version: Mellanox ConnectX 6.
But Mellanox ConnectX 6 Dx can share rdma resources in k8s cluster.
im not sure i understand the issue, can you share the config map of device plugin ?
are you changing the link type ? of both ports ? or of a single port of the NIC ? when changing link type the netdevice name changes
My Mellanox NIC config
root@gpu-11:~$ mst status -v
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX6(rev:0) /dev/mst/mt4123_pciconf3 ### mlx5_9 net-ens31np0 1
ConnectX6(rev:0) /dev/mst/mt4123_pciconf2 ### mlx5_8 net-ens30np0 1
ConnectX6(rev:0) /dev/mst/mt4123_pciconf1 ### mlx5_3 net-ens25np0 0
ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 ### mlx5_2 net-ens24np0 0
root@gpu-11:~# mlxconfig -d /dev/mst/mt4123_pciconf3 query | grep LINK_TYPE_P1
LINK_TYPE_P1 ETH(2)
k8s clsuter apply rdma-share
rdma-device
Name: rdma-devices
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
config.json:
----
{
"periodicUpdateInterval": 300,
"configList": [{
"resourceName": "hca_shared_devices_a",
"rdmaHcaMax": 1000,
"devices": ["ens24np0"]
},
{
"resourceName": "hca_shared_devices_b",
"rdmaHcaMax": 1000,
"devices": ["ens25np0"]
},
{
"resourceName": "hca_shared_devices_c",
"rdmaHcaMax": 1000,
"devices": ["ens30np0"]
},
{
"resourceName": "hca_shared_devices_d",
"rdmaHcaMax": 1000,
"devices": ["ens31np0"]
}
]
}
BinaryData
====
rdma-shared-dp-ds Daemonset
Name: rdma-shared-dp-ds
Selector: name=rdma-shared-dp-ds
Node-Selector: <none>
Labels: <none>
Annotations: deprecated.daemonset.template.generation: 4
Desired Number of Nodes Scheduled: 70
Current Number of Nodes Scheduled: 70
Number of Nodes Scheduled with Up-to-date Pods: 70
Number of Nodes Scheduled with Available Pods: 70
Number of Nodes Misscheduled: 0
Pods Status: 70 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=rdma-shared-dp-ds
Annotations: kubectl.kubernetes.io/restartedAt: 2024-01-24T12:41:15+08:00
Containers:
k8s-rdma-shared-dp-ds:
Image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:1.4.0
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/dev/ from devs (rw)
/k8s-rdma-shared-dev-plugin from config (rw)
/var/lib/kubelet/ from device-plugin (rw)
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rdma-devices
Optional: false
devs:
Type: HostPath (bare host directory volume)
Path: /dev/
HostPathType:
Priority Class Name: system-node-critical
Events: <none>
Result
[root@master1 ~]# kubectl describe node gpu-11
Name: gpu-11
Roles: <none>
Labels: gpu=A100
kubernetes.io/arch=amd64
.....
Capacity:
.....
nvidia.com/gpu: 8
pods: 110
rdma/hca_shared_devices_a: 0
rdma/hca_shared_devices_b: 0
rdma/hca_shared_devices_c: 0
rdma/hca_shared_devices_d: 0
Allocatable:
.....
nvidia.com/gpu: 8
pods: 110
rdma/hca_shared_devices_a: 0
rdma/hca_shared_devices_b: 0
rdma/hca_shared_devices_c: 0
rdma/hca_shared_devices_d: 0
can you provide device plugin logs and the content of /dev/infiniband folder of the node ?
/dev/infiniband
root@gpu-11:/dev/infiniband# ls -alh
total 0
drwxr-xr-x 2 root root 660 Dec 28 11:15 .
drwxr-xr-x 24 root root 5.6K Jan 18 17:26 ..
crw------- 1 root root 231, 64 Dec 28 14:10 issm0
crw------- 1 root root 231, 65 Dec 28 14:10 issm1
crw------- 1 root root 231, 66 Dec 28 14:10 issm2
crw------- 1 root root 231, 67 Dec 28 14:10 issm3
crw------- 1 root root 231, 68 Dec 28 14:10 issm4
crw------- 1 root root 231, 69 Dec 28 14:10 issm5
crw------- 1 root root 231, 70 Dec 28 14:10 issm6
crw------- 1 root root 231, 71 Dec 28 14:10 issm7
crw------- 1 root root 231, 72 Dec 28 14:10 issm8
crw------- 1 root root 231, 73 Dec 28 14:10 issm9
crw-rw-rw- 1 root root 10, 56 Dec 28 14:10 rdma_cm
crw------- 1 root root 231, 0 Dec 28 14:10 umad0
crw------- 1 root root 231, 1 Dec 28 14:10 umad1
crw------- 1 root root 231, 2 Dec 28 14:10 umad2
crw------- 1 root root 231, 3 Dec 28 14:10 umad3
crw------- 1 root root 231, 4 Dec 28 14:10 umad4
crw------- 1 root root 231, 5 Dec 28 14:10 umad5
crw------- 1 root root 231, 6 Dec 28 14:10 umad6
crw------- 1 root root 231, 7 Dec 28 14:10 umad7
crw------- 1 root root 231, 8 Dec 28 14:10 umad8
crw------- 1 root root 231, 9 Dec 28 14:10 umad9
crw-rw-rw- 1 root root 231, 192 Dec 28 14:10 uverbs0
crw-rw-rw- 1 root root 231, 193 Dec 28 14:10 uverbs1
crw-rw-rw- 1 root root 231, 194 Dec 28 14:10 uverbs2
crw-rw-rw- 1 root root 231, 195 Dec 28 14:10 uverbs3
crw-rw-rw- 1 root root 231, 196 Dec 28 14:10 uverbs4
crw-rw-rw- 1 root root 231, 197 Dec 28 14:10 uverbs5
crw-rw-rw- 1 root root 231, 198 Dec 28 14:10 uverbs6
crw-rw-rw- 1 root root 231, 199 Dec 28 14:10 uverbs7
crw-rw-rw- 1 root root 231, 200 Dec 28 14:10 uverbs8
crw-rw-rw- 1 root root 231, 201 Dec 28 14:10 uverbs9
devive plugin log
``
[root@master1 k8s]# kubectl -n kube-system logs rdma-shared-dp-ds-6jknp
2024/01/29 02:36:25 Starting K8s RDMA Shared Device Plugin version= master
2024/01/29 02:36:25 resource manager reading configs
Using Kubelet Plugin Registry Mode
2024/01/29 02:36:25 Reading /k8s-rdma-shared-dev-plugin/config.json
2024/01/29 02:36:25 loaded config: [{ResourceName:hca_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_b ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_c ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_d ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}}]
2024/01/29 02:36:25 periodic update interval: +300
2024/01/29 02:36:25 Discovering host devices
2024/01/29 02:36:25 discovering host network devices
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 Initializing resource servers
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_a ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_b ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_c ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_d ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}}
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:36:25 Starting all servers...
2024/01/29 02:36:25 starting rdma/hca_shared_devices_a device plugin endpoint at: hca_shared_devices_a.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_a device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_b device plugin endpoint at: hca_shared_devices_b.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_b device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_c device plugin endpoint at: hca_shared_devices_c.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_c device plugin endpoint started serving
2024/01/29 02:36:25 starting rdma/hca_shared_devices_d device plugin endpoint at: hca_shared_devices_d.sock
2024/01/29 02:36:25 rdma/hca_shared_devices_d device plugin endpoint started serving
2024/01/29 02:36:25 All servers started.
2024/01/29 02:36:25 Listening for term signals
2024/01/29 02:36:25 Starting OS watcher.
2024/01/29 02:41:25 discovering host network devices
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_a"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_b"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_c"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found"
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found"
2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_d"
2024/01/29 02:41:25 exposing "1000" devices
2024/01/29 02:46:25 discovering host network devices
can you provide device plugin logs and the content of
/dev/infinibandfolder of the node ?
I want fix the problem . Can you show me how to modify this plugin?
from the logs, device plugin behaves as expected.
i see that device plugin discovered resources properly. kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)
can you provide the output of: /var/lib/kubelet /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins /var/lib/kubelet/device_plugins
can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?
[1] https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/0b407a41030d25ade59017cfc8494fff2456dda4/pkg/resources/server.go#L299
from the logs, device plugin behaves as expected.
i see that device plugin discovered resources properly. kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)
can you provide the output of: /var/lib/kubelet /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins /var/lib/kubelet/device_plugins
can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?
[1]
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/0b407a41030d25ade59017cfc8494fff2456dda4/pkg/resources/server.go#L299
the kubelet --root-dir is /data/kubelet . others , such as the NVIDIA plugin and csi-nfs are working.
root@gpu-11:/var/lib/kubelet# tree
.
├── config.yaml
├── device-plugins
│ ├── DEPRECATION
│ ├── device-plugins
│ ├── kubelet_internal_checkpoint
│ ├── kubelet.sock
│ ├── nvidia.sock
│ └── plugins_registry
├── kubeadm-flags.env
├── pki
│ ├── kubelet-client-2024-01-08-11-59-36.pem
│ ├── kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-01-08-11-59-36.pem
│ ├── kubelet.crt
│ └── kubelet.key
├── plugins_registry
│ ├── hca_shared_devices_a.sock
│ ├── hca_shared_devices_b.sock
│ ├── hca_shared_devices_c.sock
│ └── hca_shared_devices_d.sock
└── pod-resources
6 directories, 14 files
ok,
so is it /data/kubelet or /var/lib/kubelet ?
your tree output is of the latter but you say kubelet root is the former.
did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ? the layout looks OK, if both kubelet and device plugin have the same paths it should work.
please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).
ok, so is it
/data/kubeletor/var/lib/kubelet? yourtreeoutput is of the latter but you say kubelet root is the former.did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ? the layout looks OK, if both kubelet and device plugin have the same paths it should work.
please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).
I'm showing /var/lib/kubelet directory tree. the plugin daemonset configure in history chat , can you find.
my environment:
os version: Ubuntu 20.04 kernel 5.4.0-100-generic
kubernetes version: v1.23.0
lspci | grep Mell: Mellanox Technologies MT28908 Family [ConnectX-6]
Mellanox version: MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu20.04-x86_64.tgz
if your kubelet root dir is configured as /data/kubelet then you need device plugin to mount the same directory IMO.
can you try it ?
that is: mount /data/kubelet to /var/lib/kubelet in the device plugin daemonset.
apart of that, everything looks ok to me
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/dd32974f4821a7e87c8dfa5c20397438ae878aeb/pkg/resources/resources_manager.go#L58
Why must use the default kubelet --root-dir ?
when I used the default --root-dir, it was okay. but then the not running or there were no allocatable devices, As a result, I decided to change the directory.
new log show
2024/03/22 05:58:38 Starting OS watcher.
2024/03/22 05:58:49 hca_3.sock failed to be registered at Kubelet: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded; restarting.
Daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-dp-ds
namespace: cni-plugin
spec:
selector:
matchLabels:
name: rdma-shared-dp-ds
template:
metadata:
labels:
name: rdma-shared-dp-ds
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: rdma
operator: In
values:
- sugon
hostNetwork: true
priorityClassName: system-node-critical
containers:
- image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin
name: k8s-rdma-shared-dp-ds
imagePullPolicy: IfNotPresent
#securityContext:
# privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: plugins-registry
mountPath: /var/lib/kubelet/plugins_registry
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devs
mountPath: /dev/
volumes:
- name: device-plugin
hostPath:
path: /data/kubelet/device-plugins
- name: plugins-registry
hostPath:
path: /data/kubelet/plugins_registry
- name: config
configMap:
name: rdma-devices
items:
- key: config.json
path: config.json
- name: devs
hostPath:
path: /dev/
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: cni-plugin
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [{
"resourceName": "hca_3",
"rdmaHcaMax": 1000,
"selectors": {
"ifNames": ["ens24np0"]
}
}
]
}
start kubelet argument
/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --hostname-override=gpu-186 --network-plugin=cni --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6 --root-dir=/data/kubelet --node-ip=192.168.1.9 --max-pods=20 -v=4
kubelet version v1.23.0
kubelet log
Mar 22 13:58:38 GPU-186 kubelet[2736666]: I0322 13:58:38.523704 2736666 plugin_watcher.go:203] "Adding socket path or updating timestamp to desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.470223 2736666 reconciler.go:160] "OperationExecutor.RegisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:<nil> Name:}
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.471951 2736666 manager.go:308] "Got Plugin at endpoint with versions" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock" versions=[v1alpha1 v1beta1]
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.472004 2736666 manager.go:325] "Registering plugin at endpoint" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: W0322 13:58:39.472247 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:40 GPU-186 kubelet[2736666]: W0322 13:58:40.473408 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:42 GPU-186 kubelet[2736666]: W0322 13:58:42.099541 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:44 GPU-186 kubelet[2736666]: W0322 13:58:44.243570 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:47 GPU-186 kubelet[2736666]: W0322 13:58:47.653129 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.472945 2736666 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: I0322 13:58:49.473931 2736666 plugin_watcher.go:215] "Removing socket path from desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.474152 2736666 goroutinemap.go:150] Operation for "/data/kubelet/plugins_registry/hca_3.sock" failed. No retries permitted until 2024-03-22 13:58:49.974116125 +0800 CST m=+87.139471161 (durationBeforeRetry 500ms). Error: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded: rpc error: code = Unavailable desc = error reading from server: EOF
Mar 22 13:58:49 GPU-186 kubelet[2736666]: W0322 13:58:49.474224 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/data/kubelet/plugins_registry/hca_3.sock /data/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /data/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:50 GPU-186 kubelet[2736666]: I0322 13:58:50.476259 2736666 reconciler.go:143] "OperationExecutor.UnregisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:0xc000630000 Name:rdma/hca_3}
show the kubelet --root-dir=/data/kubelet/plugins_registry
root@GPU-186:/data/kubelet# tree /data/kubelet/plugins_registry/
/data/kubelet/plugins_registry/
└── nfs.csi.k8s.io-reg.sock
the rdma-shared-dev-plugin not create the socket file at /data/kubelet/plugins_registry diretory