sriov-network-device-plugin icon indicating copy to clipboard operation
sriov-network-device-plugin copied to clipboard

sriovdp won't reconnect to kubelet if connection is gone

Open lynic opened this issue 3 years ago • 13 comments
trafficstars

After sriovdp started, it will register to kubelet and wait for connection. Once kubelet connects back, that netstat looks like:

root [ /home/capv ]# netstat -anlp|grep plugins_registry
unix  2      [ ACC ]     STREAM     LISTENING     223162497 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix  2      [ ACC ]     STREAM     LISTENING     223162499 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
unix  3      [ ]         STREAM     CONNECTED     223161510 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix  3      [ ]         STREAM     CONNECTED     223162501 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock

Now if somehow the connection between sriovdp and kubelet was disconnected, sriovdp won't register again and connected back. I can trigger it by delete the sock file, that the CONNECTED session is gone. After that, unless restart sriovdp or kubelet, the kubelet won't connect back to sriovdp again, and the resource number in worker node remains to "0"

root [ /home/capv ]# rm /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
root [ /home/capv ]# netstat -anlp|grep plugins_registry
unix  2      [ ACC ]     STREAM     LISTENING     223162497 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock
unix  2      [ ACC ]     STREAM     LISTENING     223162499 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_2.sock
unix  3      [ ]         STREAM     CONNECTED     223161510 29514/sriovdp       /var/lib/kubelet/plugins_registry/intel.com_rs_pool_1.sock

Expected behavior: sriovdp should register to kubelet again if connection is disrupted.

lynic avatar Feb 26 '22 05:02 lynic

I will take a look

rollandf avatar Mar 07 '22 16:03 rollandf

/cc @bn222

zshi-redhat avatar Mar 08 '22 01:03 zshi-redhat

Any updates? Seen this issue on some of RT kernel env and the root cause still haven't been identified yet. But the steps provided could help to manually reproduce this issue.

lynic avatar Mar 29 '22 06:03 lynic

@rollandf Did you take a look or do you want me to follow up on this?

bn222 avatar Mar 29 '22 08:03 bn222

Hey @bn222 , I did not start to look at it yet.

rollandf avatar Mar 29 '22 09:03 rollandf

@lynic We can add a watcher to check if the file exists and if not restart the connection. Something similar is done in rdma-shared device plugin here.

But are we sure this is what happens on your setup? Is the socket file missing when you see the issue?

rollandf avatar Apr 19 '22 08:04 rollandf

The sock file still exists, just the "CONNECTED" session lost, meaning somehow kubelet didn't connect back to sriovdp, sriovdp probably need to re-register to kubelet. (same as restart sriovdp process)

lynic avatar Apr 19 '22 08:04 lynic

Do you have any logs in kubelet? If I restart kubelet I see that the socket files are deleted and the connections are re initiated later. Seems to follow this.

I checked some other device plugins and I did not see an implementation of watching the connection state and re-registering.

rollandf avatar Apr 19 '22 09:04 rollandf

Also, do you have logs from the device plugin pod?

rollandf avatar Apr 20 '22 08:04 rollandf

@lynic the entity which is in charge to initiate the connection is kubelet[1].

SRIOV device plugin exposes two grpc services on that socket

  1. plugin registration [2]
  2. device plugin [3]

[1] https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/pluginmanager/pluginwatcher [2] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/pluginregistration/v1/api.proto [3] https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto

adrianchiris avatar Apr 20 '22 09:04 adrianchiris

Yes, either restart sriovdp or kubelet could recover the connection, restart sriovdp is easier. Sorry, the logs of kubelet were flushed. The logs of sriovdp is just as normal, but no more new logs after a timestamp. The pods failed to start saying "Warning FailedScheduling 177m default-scheduler 0/5 nodes are available: 1 Insufficient intel.com/intel_sriov,"

I0204 06:24:56.429751   12731 server.go:131] ListAndWatch(intel_acc100_fec) invoked
I0204 06:24:56.429774   12731 server.go:139] ListAndWatch(intel_acc100_fec): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:13:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.429870   12731 server.go:131] ListAndWatch(pci_sriov_net_ca) invoked
I0204 06:24:56.429876   12731 server.go:106] Plugin: intel.com_pci_sriov_net_fh0.sock gets registered successfully at Kubelet
I0204 06:24:56.429875   12731 server.go:139] ListAndWatch(pci_sriov_net_ca): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:14:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.429925   12731 server.go:106] Plugin: intel.com_pci_sriov_net_ca.sock gets registered successfully at Kubelet
I0204 06:24:56.429924   12731 server.go:131] ListAndWatch(pci_sriov_net_f1c) invoked
I0204 06:24:56.429953   12731 server.go:106] Plugin: intel.com_pci_sriov_net_f1c.sock gets registered successfully at Kubelet
I0204 06:24:56.429931   12731 server.go:139] ListAndWatch(pci_sriov_net_f1c): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:04:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430010   12731 server.go:131] ListAndWatch(pci_sriov_net_fh0m) invoked
I0204 06:24:56.430014   12731 server.go:139] ListAndWatch(pci_sriov_net_fh0m): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1b:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430051   12731 server.go:131] ListAndWatch(pci_sriov_net_fh0) invoked
I0204 06:24:56.430066   12731 server.go:131] ListAndWatch(pci_sriov_net_f1u) invoked
I0204 06:24:56.430070   12731 server.go:139] ListAndWatch(pci_sriov_net_f1u): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:1c:00.0,Health:Healthy,Topology:nil,},},}
I0204 06:24:56.430055   12731 server.go:139] ListAndWatch(pci_sriov_net_fh0): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:0c:00.0,Health:Healthy,Topology:nil,},},}

lynic avatar Apr 20 '22 13:04 lynic

So, the sriovdp pods are not running? Let's start to understand why. Are the worker nodes healthy? Do you have node selectors on the sriovdp pods? Do you have workers that fit the node selector?

Also in the original issue, the sriovdp pods stopped running?

rollandf avatar Apr 21 '22 05:04 rollandf

Hi @lynic is this issue still active?

SchSeba avatar Aug 11 '22 09:08 SchSeba

closing this issue feel free to reopen if needed

SchSeba avatar Dec 21 '23 08:12 SchSeba