k8s-rdma-shared-dev-plugin icon indicating copy to clipboard operation
k8s-rdma-shared-dev-plugin copied to clipboard

Are you support ConnectX 6 network interface card?Why the resources Capacity and Allocatabel values is 0 at k8s cluster?

Open sober-wang opened this issue 1 year ago • 3 comments

Are you support ConnectX 6 network interface card?

sober-wang avatar Jan 19 '24 03:01 sober-wang

Are you support ConnectX 6 network interface card?

Hi @sober-wang . ConnectX 6 is a supported NIC

e0ne avatar Jan 19 '24 08:01 e0ne

Are you support ConnectX 6 network interface card?

Hi @sober-wang . ConnectX 6 is a supported NIC

But my k8s resource description Capacity and Allocatable is 0 values.

image


my os: ubuntu 20.04 kubernetes version: 1.23 my kubelet --root-dir: /data/kubelet

the plugin configuration. image and workload. image

root@gpu-11:~# ibdev2netdev 
mlx5_0 port 1 ==> ens12f0np0 (Down)
mlx5_1 port 1 ==> ens12f1np1 (Down)
mlx5_2 port 1 ==> ens24np0 (Up)
mlx5_3 port 1 ==> ens25np0 (Up)
mlx5_4 port 1 ==> bondYW (Up)
mlx5_5 port 1 ==> ens17f1np1 (Down)
mlx5_6 port 1 ==> bondYW (Up)
mlx5_7 port 1 ==> ens18f1np1 (Down)
mlx5_8 port 1 ==> ens30np0 (Up)
mlx5_9 port 1 ==> ens31np0 (Up)
root@gpu-11:~# mst status -v 
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX6(rev:0)        /dev/mst/mt4123_pciconf3      df:00.0   mlx5_9          net-ens31np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf2      a0:00.0   mlx5_8          net-ens30np0              1     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf1      72:00.0   mlx5_3          net-ens25np0              0     

ConnectX6(rev:0)        /dev/mst/mt4123_pciconf0      58:00.0   mlx5_2          net-ens24np0              0     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf2.1    83:00.1   mlx5_7          net-ens18f1np1            1     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf2      83:00.0   mlx5_6          net-bondYW                1     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf1.1    82:00.1   mlx5_5          net-ens17f1np1            1     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf1      82:00.0   mlx5_4          net-bondYW                1     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf0.1    18:00.1   mlx5_1          net-ens12f1np1            0     

ConnectX4LX(rev:0)      /dev/mst/mt4117_pciconf0      18:00.0   mlx5_0          net-ens12f0np0            0  

sober-wang avatar Jan 22 '24 05:01 sober-wang

@sober-wang , I think this might relate to your use of a custom root-dir for kubelet. If you're using the nvidia/mellanox network-operator, it hardcodes the volume mounts for the pod that runs this service to the standard kubelet root path.

The kubernetes manifest in this repository is guilty of the same.

Not really related, but same idea: https://github.com/kubernetes/kubernetes/issues/120626

tdg5 avatar May 15 '24 17:05 tdg5