sriov-network-device-plugin
sriov-network-device-plugin copied to clipboard
sriov-network-device-plugin can't expose resource in node
What happened?
After deploy this plugin, I can't get sriov resource in my nodes. And seems this plugin don't connect with kubelet, I can't find some sentences `` Plugin: mellanox.com/mlnx_sriov_rdma gets registered successfully at Kubelet` in below logs.
What did you expect to happen?
Get the specific resource about sriov.
What are the minimal steps needed to reproduce the bug?
- Config sriov feature in nodes
- Deploy SR-IOV CNI
- Deploy sriov-network-device-plugin
- Deploy Multus CNI
Anything else we need to know?
Component Versions
Please fill in the below table with the version numbers of components used.
| Component | Version |
|---|---|
| SR-IOV Network Device Plugin | v3.7.0 |
| SR-IOV CNI Plugin | v2.8.0 |
| Multus | v4.1.0 |
| Kubernetes | 1.23.6 |
| OS | Centos 8.2.2004 |
Config Files
Config file locations may be config dependent.
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition
Logs
SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)
I0815 07:35:27.139922 1 manager.go:57] Using Kubelet Plugin Registry Mode I0815 07:35:27.140181 1 main.go:46] resource manager reading configs I0815 07:35:27.140209 1 manager.go:86] raw ResourceList: { "resourceList": [ { "resourceName": "mlnx_sriov_rdma", "resourcePrefix": "mellanox.com", "selectors": { "vendors": ["15b3"], "devices": ["101c"], "driver": "mlx5_core", "isRdma": true } } ] } I0815 07:35:27.140303 1 factory.go:211] *types.NetDeviceSelectors for resource mlnx_sriov_rdma is [0xc00023f0e0] I0815 07:35:27.140315 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox.com ResourceName:mlnx_sriov_rdma DeviceType:netDevice ExcludeTopology:false Selectors:0xc000190b28 AdditionalInfo:map[] SelectorObjs:[0xc00023f0e0]}] I0815 07:35:27.140347 1 manager.go:217] validating resource name "mellanox.com/mlnx_sriov_rdma" I0815 07:35:27.140354 1 main.go:62] Discovering host devices I0815 07:35:28.124721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.127282 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127429 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127538 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127629 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.129909 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130018 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130121 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130213 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130335 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130477 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130597 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.132753 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132855 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132941 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133047 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133140 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135266 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135361 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135464 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135548 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135638 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135656 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135661 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135666 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135673 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135679 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135684 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135690 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135694 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135699 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135703 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135708 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135713 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135720 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135726 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135730 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135735 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135742 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135747 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135752 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135757 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135761 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135765 1 main.go:68] Initializing resource servers I0815 07:35:28.135772 1 manager.go:117] number of config: 1 I0815 07:35:28.135785 1 manager.go:121] Creating new ResourcePool: mlnx_sriov_rdma I0815 07:35:28.135789 1 manager.go:122] DeviceType: netDevice W0815 07:35:28.149419 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.0 not found. Are RDMA modules loaded? W0815 07:35:28.149783 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.1 not found. Are RDMA modules loaded? I0815 07:35:28.156081 1 manager.go:138] initServers(): selector index 0 will register 16 devices I0815 07:35:28.156097 1 factory.go:124] device added: [identifier: 0000:05:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156105 1 factory.go:124] device added: [identifier: 0000:05:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156110 1 factory.go:124] device added: [identifier: 0000:05:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156115 1 factory.go:124] device added: [identifier: 0000:05:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156118 1 factory.go:124] device added: [identifier: 0000:47:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156122 1 factory.go:124] device added: [identifier: 0000:47:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156126 1 factory.go:124] device added: [identifier: 0000:47:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156130 1 factory.go:124] device added: [identifier: 0000:47:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156135 1 factory.go:124] device added: [identifier: 0000:8e:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156139 1 factory.go:124] device added: [identifier: 0000:8e:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156143 1 factory.go:124] device added: [identifier: 0000:8e:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156146 1 factory.go:124] device added: [identifier: 0000:8e:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156150 1 factory.go:124] device added: [identifier: 0000:d2:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156154 1 factory.go:124] device added: [identifier: 0000:d2:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156158 1 factory.go:124] device added: [identifier: 0000:d2:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156162 1 factory.go:124] device added: [identifier: 0000:d2:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156191 1 manager.go:156] New resource server is created for mlnx_sriov_rdma ResourcePool I0815 07:35:28.156199 1 main.go:74] Starting all servers... I0815 07:35:28.156803 1 server.go:254] starting mlnx_sriov_rdma device plugin endpoint at: mellanox.com_mlnx_sriov_rdma.sock I0815 07:35:28.156947 1 main.go:79] All servers started. I0815 07:35:28.156954 1 main.go:80] Listening for term signals
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
Hey, can you ckeck kubelet logs?
Also, is kubelet service defined with a --root-dir param?
@rollandf Below is the kubelet.service
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
[Service]
WorkingDirectory=/var/lib/kubelet
ExecStartPre=/etc/kubernetes/kubelet-precheck.sh
ExecStart=/usr/bin/kubelet-1.23.6 \
--kubeconfig=/etc/kubernetes/admin.kubeconfig \
--config=/etc/kubernetes/kubelet-config.yaml \
--hostname-override=10.32.13.1 \
\
--container-runtime=remote \
--runtime-request-timeout=15m \
--container-runtime-endpoint=unix:///run/containerd/containerd.sock \
\
--network-plugin=cni \
--root-dir=/data/kubelet \
--v=2 \
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Hey, can you ckeck kubelet logs?
How do I filter logs to find some useful message?
The issue seems to be with the use of root-dir
See similar discussion here:
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/issues/96
BTW, how did you install the cluster? Did you configure the root-dir or it is a default?
@rollandf Another question, I want to know if my configmap.yaml is fine for this plugin.
At first glance, it seems OK.
Do you know where the root-dir definition comes from?
@rollandf root-dir is defined when kubelet is installed with parms. I guess kubelet only watch root-dir plugins_registry. So I don't get any log like Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock" in kubelet.
By the way, my root-dir is at /data/kubelet
What is parms? Any links?
For now, try to mount to the new root /data/kubelet in the deployment yaml:
/data/kubelet/device-plugins here:
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L62
/data/kubelet/plugins_registry here:
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L65
@rollandf Sure, But I still need to do ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry and I can't find the reason. I guess the device plugin tell kubelet the plugins_registry is in `/var/lib/kubelet· which is hardcoded in container, but the really device plugin sock is in root-dir(/data/kubelet).
Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: I0816 14:45:27.456142 587969 manager.go:325] "Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:27.456307 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:28 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:28.456559 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:30 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:30.045068 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:32 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:32.244564 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:36 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:36.863409 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:37 10.32.13.1 kubelet-1.23.6[587969]: E0816 14:45:37.456842 587969 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
Hi @jeffreyyjp can't you update the volume mount for the device plugin container?
i have set "enhancement" label on it since device plugin never supported alternative kubelet root dir.
after seeing @jeffreyyjp latest comment i believe its not enough to update the mounts.
i believe its because of how we do plugin resgistration. we set endpoint to the path within the container in PluginInfo message which is part of GetInfo call. see [1][2]
[1] https://github.com/kubernetes/kubernetes/blob/cb7b4ea648a97bdbf8f4f1b8655a7a110c9f78d0/staging/src/k8s.io/kubelet/pkg/apis/pluginregistration/v1/api.proto#L31 [2]https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/7dedc64cd1b89275059f33d7a2ecae9e03388e79/pkg/resources/server.go#L107
i think, if we leave Endpoint field unset, kubernetes will use the same path for the socket as it did for registration.
@SchSeba I already updated my volume mount about host path, but I need to add ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry in my host(not container). And then everything is fine.