sriov-network-device-plugin
sriov-network-device-plugin copied to clipboard
sriovdp won't reconnect to kubelet if connection is gone
After sriovdp started, it will register to kubelet and wait for connection. Once kubelet connects back, that netstat looks like:
root [ /home/capv ]# netstat -anlp|grep plugins_registry
unix 2 [ ACC ] STREAM LISTENING 223162497 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix 2 [ ACC ] STREAM LISTENING 223162499 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
unix 3 [ ] STREAM CONNECTED 223161510 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix 3 [ ] STREAM CONNECTED 223162501 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
Now if somehow the connection between sriovdp and kubelet was disconnected, sriovdp won't register again and connected back.
I can trigger it by delete the sock file, that the CONNECTED session is gone. After that, unless restart sriovdp or kubelet, the kubelet won't connect back to sriovdp again, and the resource number in worker node remains to "0"
root [ /home/capv ]# rm /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
root [ /home/capv ]# netstat -anlp|grep plugins_registry
unix 2 [ ACC ] STREAM LISTENING 223162497 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix 2 [ ACC ] STREAM LISTENING 223162499 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
unix 3 [ ] STREAM CONNECTED 223161510 29514/sriovdp /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
Expected behavior: sriovdp should register to kubelet again if connection is disrupted.
I will take a look
/cc @bn222
Any updates? Seen this issue on some of RT kernel env and the root cause still haven't been identified yet. But the steps provided could help to manually reproduce this issue.
@rollandf Did you take a look or do you want me to follow up on this?
Hey @bn222 , I did not start to look at it yet.
@lynic We can add a watcher to check if the file exists and if not restart the connection. Something similar is done in rdma-shared device plugin here.
But are we sure this is what happens on your setup? Is the socket file missing when you see the issue?
The sock file still exists, just the "CONNECTED" session lost, meaning somehow kubelet didn't connect back to sriovdp, sriovdp probably need to re-register to kubelet. (same as restart sriovdp process)
Do you have any logs in kubelet? If I restart kubelet I see that the socket files are deleted and the connections are re initiated later. Seems to follow this.
I checked some other device plugins and I did not see an implementation of watching the connection state and re-registering.
Also, do you have logs from the device plugin pod?
@lynic the entity which is in charge to initiate the connection is kubelet[1].
SRIOV device plugin exposes two grpc services on that socket
- plugin registration [2]
- device plugin [3]
[1] https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/pluginmanager/pluginwatcher [2] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/pluginregistration/v1/api.proto [3] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto
Yes, either restart sriovdp or kubelet could recover the connection, restart sriovdp is easier. Sorry, the logs of kubelet were flushed. The logs of sriovdp is just as normal, but no more new logs after a timestamp. The pods failed to start saying "Warning FailedScheduling 177m default-scheduler 0/5 nodes are available: 1 Insufficient intel.com/intel_sriov,"
I0204 06:24:56.429751 12731 server.go:131] ListAndWatch(intel_acc100_fec) invoked
I0204 06:24:56.429774 12731 server.go:139] ListAndWatch(intel_acc100_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:13:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.429870 12731 server.go:131] ListAndWatch(pci_sriov_net_ca) invoked
I0204 06:24:56.429876 12731 server.go:106] Plugin: intel.com_pci_sriov_net_fh0.sock gets registered successfully at Kubelet
I0204 06:24:56.429875 12731 server.go:139] ListAndWatch(pci_sriov_net_ca): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:14:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.429925 12731 server.go:106] Plugin: intel.com_pci_sriov_net_ca.sock gets registered successfully at Kubelet
I0204 06:24:56.429924 12731 server.go:131] ListAndWatch(pci_sriov_net_f1c) invoked
I0204 06:24:56.429953 12731 server.go:106] Plugin: intel.com_pci_sriov_net_f1c.sock gets registered successfully at Kubelet
I0204 06:24:56.429931 12731 server.go:139] ListAndWatch(pci_sriov_net_f1c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:04:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430010 12731 server.go:131] ListAndWatch(pci_sriov_net_fh0m) invoked
I0204 06:24:56.430014 12731 server.go:139] ListAndWatch(pci_sriov_net_fh0m): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1b:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430051 12731 server.go:131] ListAndWatch(pci_sriov_net_fh0) invoked
I0204 06:24:56.430066 12731 server.go:131] ListAndWatch(pci_sriov_net_f1u) invoked
I0204 06:24:56.430070 12731 server.go:139] ListAndWatch(pci_sriov_net_f1u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1c:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430055 12731 server.go:139] ListAndWatch(pci_sriov_net_fh0): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:0c:00.0,Health:Healthy,Topology:nil,},},}
So, the sriovdp pods are not running? Let's start to understand why. Are the worker nodes healthy? Do you have node selectors on the sriovdp pods? Do you have workers that fit the node selector?
Also in the original issue, the sriovdp pods stopped running?
Hi @lynic is this issue still active?
closing this issue feel free to reopen if needed