k8s-rdma-shared-dev-plugin icon indicating copy to clipboard operation
k8s-rdma-shared-dev-plugin copied to clipboard

This plugin not working when used IB NIC the LINK_TYPE_P1=ETH!

Open sober-wang opened this issue 1 year ago • 12 comments

This plugin not working when used IB NIC the LINK_TYPE_P1=ETH!

mlxconfig -d <DIVICE_INFO> query | grep LINK_TYPE_P1
mlxconfig -d <DIVICE_INFO> set LINK_TYPE_P1=1
reboot 

The IB NIC change ETH model. run the rdma-shared-dev-plugin in k8s cluster. The node Capacity and Allocatable resourceName values is 0.

NIC version: Mellanox ConnectX 6.

But Mellanox ConnectX 6 Dx can share rdma resources in k8s cluster.

sober-wang avatar Jan 25 '24 12:01 sober-wang

im not sure i understand the issue, can you share the config map of device plugin ?

are you changing the link type ? of both ports ? or of a single port of the NIC ? when changing link type the netdevice name changes

adrianchiris avatar Jan 25 '24 13:01 adrianchiris

My Mellanox NIC config

root@gpu-11:~$ mst status -v 

MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3      ###       mlx5_9          net-ens31np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf2      ###       mlx5_8          net-ens30np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf1      ###       mlx5_3          net-ens25np0              0     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      ###       mlx5_2          net-ens24np0              0 



root@gpu-11:~# mlxconfig -d /dev/mst/mt4123_pciconf3 query | grep LINK_TYPE_P1
         LINK_TYPE_P1                                ETH(2)

k8s clsuter apply rdma-share

rdma-device

Name:         rdma-devices
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
config.json:
----
{
    "periodicUpdateInterval": 300,
    "configList": [{
         "resourceName": "hca_shared_devices_a",
         "rdmaHcaMax": 1000,
         "devices": ["ens24np0"]
       },
       {
         "resourceName": "hca_shared_devices_b",
         "rdmaHcaMax": 1000,
         "devices": ["ens25np0"] 
       },
       {
         "resourceName": "hca_shared_devices_c",
         "rdmaHcaMax": 1000,
         "devices": ["ens30np0"] 
       },
       {
         "resourceName": "hca_shared_devices_d",
         "rdmaHcaMax": 1000,
         "devices": ["ens31np0"] 
       }
    ]
}


BinaryData
====

rdma-shared-dp-ds Daemonset

Name:           rdma-shared-dp-ds
Selector:       name=rdma-shared-dp-ds
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 4
Desired Number of Nodes Scheduled: 70
Current Number of Nodes Scheduled: 70
Number of Nodes Scheduled with Up-to-date Pods: 70
Number of Nodes Scheduled with Available Pods: 70
Number of Nodes Misscheduled: 0
Pods Status:  70 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=rdma-shared-dp-ds
  Annotations:  kubectl.kubernetes.io/restartedAt: 2024-01-24T12:41:15+08:00
  Containers:
   k8s-rdma-shared-dp-ds:
    Image:        ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:1.4.0
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /dev/ from devs (rw)
      /k8s-rdma-shared-dev-plugin from config (rw)
      /var/lib/kubelet/ from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/
    HostPathType:  
   config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rdma-devices
    Optional:  false
   devs:
    Type:               HostPath (bare host directory volume)
    Path:               /dev/
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:                 <none>

Result

[root@master1 ~]# kubectl describe node gpu-11
Name:               gpu-11
Roles:              <none>
Labels:             gpu=A100
                    kubernetes.io/arch=amd64
.....

Capacity:
.....
  nvidia.com/gpu:             8
  pods:                       110
  rdma/hca_shared_devices_a:  0
  rdma/hca_shared_devices_b:  0
  rdma/hca_shared_devices_c:  0
  rdma/hca_shared_devices_d:  0
Allocatable:
.....
  nvidia.com/gpu:             8
  pods:                       110
  rdma/hca_shared_devices_a:  0
  rdma/hca_shared_devices_b:  0
  rdma/hca_shared_devices_c:  0
  rdma/hca_shared_devices_d:  0

sober-wang avatar Jan 26 '24 06:01 sober-wang

can you provide device plugin logs and the content of /dev/infiniband folder of the node ?

adrianchiris avatar Jan 28 '24 10:01 adrianchiris

/dev/infiniband

root@gpu-11:/dev/infiniband# ls -alh
total 0
drwxr-xr-x  2 root root      660 Dec 28 11:15 .
drwxr-xr-x 24 root root     5.6K Jan 18 17:26 ..
crw-------  1 root root 231,  64 Dec 28 14:10 issm0
crw-------  1 root root 231,  65 Dec 28 14:10 issm1
crw-------  1 root root 231,  66 Dec 28 14:10 issm2
crw-------  1 root root 231,  67 Dec 28 14:10 issm3
crw-------  1 root root 231,  68 Dec 28 14:10 issm4
crw-------  1 root root 231,  69 Dec 28 14:10 issm5
crw-------  1 root root 231,  70 Dec 28 14:10 issm6
crw-------  1 root root 231,  71 Dec 28 14:10 issm7
crw-------  1 root root 231,  72 Dec 28 14:10 issm8
crw-------  1 root root 231,  73 Dec 28 14:10 issm9
crw-rw-rw-  1 root root  10,  56 Dec 28 14:10 rdma_cm
crw-------  1 root root 231,   0 Dec 28 14:10 umad0
crw-------  1 root root 231,   1 Dec 28 14:10 umad1
crw-------  1 root root 231,   2 Dec 28 14:10 umad2
crw-------  1 root root 231,   3 Dec 28 14:10 umad3
crw-------  1 root root 231,   4 Dec 28 14:10 umad4
crw-------  1 root root 231,   5 Dec 28 14:10 umad5
crw-------  1 root root 231,   6 Dec 28 14:10 umad6
crw-------  1 root root 231,   7 Dec 28 14:10 umad7
crw-------  1 root root 231,   8 Dec 28 14:10 umad8
crw-------  1 root root 231,   9 Dec 28 14:10 umad9
crw-rw-rw-  1 root root 231, 192 Dec 28 14:10 uverbs0
crw-rw-rw-  1 root root 231, 193 Dec 28 14:10 uverbs1
crw-rw-rw-  1 root root 231, 194 Dec 28 14:10 uverbs2
crw-rw-rw-  1 root root 231, 195 Dec 28 14:10 uverbs3
crw-rw-rw-  1 root root 231, 196 Dec 28 14:10 uverbs4
crw-rw-rw-  1 root root 231, 197 Dec 28 14:10 uverbs5
crw-rw-rw-  1 root root 231, 198 Dec 28 14:10 uverbs6
crw-rw-rw-  1 root root 231, 199 Dec 28 14:10 uverbs7
crw-rw-rw-  1 root root 231, 200 Dec 28 14:10 uverbs8
crw-rw-rw-  1 root root 231, 201 Dec 28 14:10 uverbs9

devive plugin log

`` [root@master1 k8s]# kubectl -n kube-system logs rdma-shared-dp-ds-6jknp 2024/01/29 02:36:25 Starting K8s RDMA Shared Device Plugin version= master 2024/01/29 02:36:25 resource manager reading configs Using Kubelet Plugin Registry Mode 2024/01/29 02:36:25 Reading /k8s-rdma-shared-dev-plugin/config.json 2024/01/29 02:36:25 loaded config: [{ResourceName:hca_shared_devices_a ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_b ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_c ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}} {ResourceName:hca_shared_devices_d ResourcePrefix: RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}}] 2024/01/29 02:36:25 periodic update interval: +300 2024/01/29 02:36:25 Discovering host devices 2024/01/29 02:36:25 discovering host network devices 2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:36:25 Initializing resource servers 2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_a ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens24np0] LinkTypes:[]}} 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_b ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens25np0] LinkTypes:[]}} 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_c ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens30np0] LinkTypes:[]}} 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:36:25 Resource: &{ResourceName:hca_shared_devices_d ResourcePrefix:rdma RdmaHcaMax:1000 Devices:[] Selectors:{Vendors:[] DeviceIDs:[] Drivers:[] IfNames:[ens31np0] LinkTypes:[]}} 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:36:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:36:25 Starting all servers... 2024/01/29 02:36:25 starting rdma/hca_shared_devices_a device plugin endpoint at: hca_shared_devices_a.sock 2024/01/29 02:36:25 rdma/hca_shared_devices_a device plugin endpoint started serving 2024/01/29 02:36:25 starting rdma/hca_shared_devices_b device plugin endpoint at: hca_shared_devices_b.sock 2024/01/29 02:36:25 rdma/hca_shared_devices_b device plugin endpoint started serving 2024/01/29 02:36:25 starting rdma/hca_shared_devices_c device plugin endpoint at: hca_shared_devices_c.sock 2024/01/29 02:36:25 rdma/hca_shared_devices_c device plugin endpoint started serving 2024/01/29 02:36:25 starting rdma/hca_shared_devices_d device plugin endpoint at: hca_shared_devices_d.sock 2024/01/29 02:36:25 rdma/hca_shared_devices_d device plugin endpoint started serving 2024/01/29 02:36:25 All servers started. 2024/01/29 02:36:25 Listening for term signals 2024/01/29 02:36:25 Starting OS watcher. 2024/01/29 02:41:25 discovering host network devices 2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.2 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.3 02 Intel Corporation I350 Gigabit Network Connection
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 DiscoverHostDevices(): device found: 0000:...:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6]
2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_a" 2024/01/29 02:41:25 exposing "1000" devices 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_b" 2024/01/29 02:41:25 exposing "1000" devices 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_c" 2024/01/29 02:41:25 exposing "1000" devices 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.0, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.1, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.2, RDMA device "issm" not found" 2024/01/29 02:41:25 error creating new device: "missing RDMA device spec for device 0000:....:00.3, RDMA device "issm" not found" 2024/01/29 02:41:25 no changes to devices for "rdma/hca_shared_devices_d" 2024/01/29 02:41:25 exposing "1000" devices 2024/01/29 02:46:25 discovering host network devices

sober-wang avatar Jan 29 '24 02:01 sober-wang

can you provide device plugin logs and the content of /dev/infiniband folder of the node ?

I want fix the problem . Can you show me how to modify this plugin?

sober-wang avatar Jan 30 '24 02:01 sober-wang

from the logs, device plugin behaves as expected.

i see that device plugin discovered resources properly. kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)

can you provide the output of: /var/lib/kubelet /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins /var/lib/kubelet/device_plugins

can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?

[1] https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/0b407a41030d25ade59017cfc8494fff2456dda4/pkg/resources/server.go#L299

adrianchiris avatar Jan 30 '24 09:01 adrianchiris

from the logs, device plugin behaves as expected.

i see that device plugin discovered resources properly. kubelet is not calling ListAndWatch [1] else we would have seen a log msg (which will then report resources on node obj)

can you provide the output of: /var/lib/kubelet /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins /var/lib/kubelet/device_plugins

can you add the yaml used for deployment of device plugin daemonset ? is it what we have in master branch ?

[1]

https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/0b407a41030d25ade59017cfc8494fff2456dda4/pkg/resources/server.go#L299

the kubelet --root-dir is /data/kubelet . others , such as the NVIDIA plugin and csi-nfs are working.

root@gpu-11:/var/lib/kubelet# tree 
.
├── config.yaml
├── device-plugins
│   ├── DEPRECATION
│   ├── device-plugins
│   ├── kubelet_internal_checkpoint
│   ├── kubelet.sock
│   ├── nvidia.sock
│   └── plugins_registry
├── kubeadm-flags.env
├── pki
│   ├── kubelet-client-2024-01-08-11-59-36.pem
│   ├── kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-01-08-11-59-36.pem
│   ├── kubelet.crt
│   └── kubelet.key
├── plugins_registry
│   ├── hca_shared_devices_a.sock
│   ├── hca_shared_devices_b.sock
│   ├── hca_shared_devices_c.sock
│   └── hca_shared_devices_d.sock
└── pod-resources

6 directories, 14 files

sober-wang avatar Feb 02 '24 01:02 sober-wang

ok, so is it /data/kubelet or /var/lib/kubelet ? your tree output is of the latter but you say kubelet root is the former.

did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ? the layout looks OK, if both kubelet and device plugin have the same paths it should work.

please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).

adrianchiris avatar Feb 04 '24 08:02 adrianchiris

ok, so is it /data/kubelet or /var/lib/kubelet ? your tree output is of the latter but you say kubelet root is the former.

did you deploy rdma-shared-device-plugin with the modified mounts as suggested in #96 ? the layout looks OK, if both kubelet and device plugin have the same paths it should work.

please provide some additional information on how to reproduce this (k8s version, OS, NIC hardware and its config).

I'm showing /var/lib/kubelet directory tree. the plugin daemonset configure in history chat , can you find.

my environment: os version: Ubuntu 20.04 kernel 5.4.0-100-generic kubernetes version: v1.23.0 lspci | grep Mell: Mellanox Technologies MT28908 Family [ConnectX-6] Mellanox version: MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu20.04-x86_64.tgz

sober-wang avatar Feb 05 '24 06:02 sober-wang

if your kubelet root dir is configured as /data/kubelet then you need device plugin to mount the same directory IMO.

can you try it ?

that is: mount /data/kubelet to /var/lib/kubelet in the device plugin daemonset.

apart of that, everything looks ok to me

adrianchiris avatar Feb 05 '24 08:02 adrianchiris

https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/dd32974f4821a7e87c8dfa5c20397438ae878aeb/pkg/resources/resources_manager.go#L58

Why must use the default kubelet --root-dir ?

when I used the default --root-dir, it was okay. but then the not running or there were no allocatable devices, As a result, I decided to change the directory.

sober-wang avatar Mar 20 '24 06:03 sober-wang

new log show

2024/03/22 05:58:38 Starting OS watcher.
2024/03/22 05:58:49 hca_3.sock failed to be registered at Kubelet: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded; restarting.

Daemonset

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: rdma-shared-dp-ds
  namespace: cni-plugin
spec:
  selector:
    matchLabels:
      name: rdma-shared-dp-ds
  template:
    metadata:
      labels:
        name: rdma-shared-dp-ds
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: rdma
                    operator: In
                    values:
                      - sugon
      hostNetwork: true
      priorityClassName: system-node-critical
      containers:
      - image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin
        name: k8s-rdma-shared-dp-ds
        imagePullPolicy: IfNotPresent
        #securityContext:
        #  privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: plugins-registry
            mountPath: /var/lib/kubelet/plugins_registry
          - name: config
            mountPath: /k8s-rdma-shared-dev-plugin
          - name: devs
            mountPath: /dev/
      volumes:
        - name: device-plugin
          hostPath:
            path: /data/kubelet/device-plugins
        - name: plugins-registry
          hostPath:
            path: /data/kubelet/plugins_registry
        - name: config
          configMap:
            name: rdma-devices
            items:
            - key: config.json
              path: config.json
        - name: devs
          hostPath:
            path: /dev/
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rdma-devices
  namespace: cni-plugin
data:
  config.json: |
    {
        "periodicUpdateInterval": 300,
        "configList": [{
             "resourceName": "hca_3",
             "rdmaHcaMax": 1000,
             "selectors": {
                "ifNames": ["ens24np0"]
             }
           }
        ]
    }

start kubelet argument

/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --hostname-override=gpu-186 --network-plugin=cni --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.6 --root-dir=/data/kubelet --node-ip=192.168.1.9 --max-pods=20 -v=4

kubelet version v1.23.0

kubelet log

Mar 22 13:58:38 GPU-186 kubelet[2736666]: I0322 13:58:38.523704 2736666 plugin_watcher.go:203] "Adding socket path or updating timestamp to desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.470223 2736666 reconciler.go:160] "OperationExecutor.RegisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:<nil> Name:}
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.471951 2736666 manager.go:308] "Got Plugin at endpoint with versions" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock" versions=[v1alpha1 v1beta1]
Mar 22 13:58:39 GPU-186 kubelet[2736666]: I0322 13:58:39.472004 2736666 manager.go:325] "Registering plugin at endpoint" plugin="rdma/hca_3" endpoint="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:39 GPU-186 kubelet[2736666]: W0322 13:58:39.472247 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:40 GPU-186 kubelet[2736666]: W0322 13:58:40.473408 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:42 GPU-186 kubelet[2736666]: W0322 13:58:42.099541 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:44 GPU-186 kubelet[2736666]: W0322 13:58:44.243570 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:47 GPU-186 kubelet[2736666]: W0322 13:58:47.653129 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/hca_3.sock /var/lib/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.472945 2736666 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: I0322 13:58:49.473931 2736666 plugin_watcher.go:215] "Removing socket path from desired state cache" path="/data/kubelet/plugins_registry/hca_3.sock"
Mar 22 13:58:49 GPU-186 kubelet[2736666]: E0322 13:58:49.474152 2736666 goroutinemap.go:150] Operation for "/data/kubelet/plugins_registry/hca_3.sock" failed. No retries permitted until 2024-03-22 13:58:49.974116125 +0800 CST m=+87.139471161 (durationBeforeRetry 500ms). Error: RegisterPlugin error -- plugin registration failed with err: failed to dial device plugin with socketPath /var/lib/kubelet/plugins_registry/hca_3.sock: failed to dial device plugin: context deadline exceeded: rpc error: code = Unavailable desc = error reading from server: EOF
Mar 22 13:58:49 GPU-186 kubelet[2736666]: W0322 13:58:49.474224 2736666 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/data/kubelet/plugins_registry/hca_3.sock /data/kubelet/plugins_registry/hca_3.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /data/kubelet/plugins_registry/hca_3.sock: connect: no such file or directory". Reconnecting...
Mar 22 13:58:50 GPU-186 kubelet[2736666]: I0322 13:58:50.476259 2736666 reconciler.go:143] "OperationExecutor.UnregisterPlugin started" plugin={SocketPath:/data/kubelet/plugins_registry/hca_3.sock Timestamp:2024-03-22 13:58:38.523730272 +0800 CST m=+75.689085324 Handler:0xc000630000 Name:rdma/hca_3}

show the kubelet --root-dir=/data/kubelet/plugins_registry

root@GPU-186:/data/kubelet# tree /data/kubelet/plugins_registry/
/data/kubelet/plugins_registry/
└── nfs.csi.k8s.io-reg.sock

the rdma-shared-dev-plugin not create the socket file at /data/kubelet/plugins_registry diretory

sober-wang avatar Mar 22 '24 06:03 sober-wang