k8s-rdma-shared-dev-plugin
k8s-rdma-shared-dev-plugin copied to clipboard
RDMA allocatable resources changed to 0 after kubelet restart
Version: v1.3.2
RDMA device plugin log:
As can be seen from the log, when kubelet restart, it triggers context canceled and restart will block because channel size is 0, context listener added in this issue: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/pull/51.
When the kubelet restarts, ListAndWatch will receive the event from the stop channel, there is no need to watch context, so I fixed the bug by removing the context listener. If necessary, i can submit a PR.
func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
resp := new(pluginapi.ListAndWatchResponse)
// Send initial list of devices
if err := rs.sendDevices(resp, s); err != nil {
return err
}
for {
select {
case <-s.Context().Done():
log.Printf("ListAndWatch stream close: %v", s.Context().Err())
return nil
case <-rs.stop:
return nil
case d := <-rs.health:
// FIXME: there is no way to recover from the Unhealthy state.
d.Health = pluginapi.Unhealthy
_ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
case <-rs.updateResource:
if err := rs.sendDevices(resp, s); err != nil {
// The old stream may not be closed properly, return to close it
// and pass the update event to the new stream for processing
rs.updateResource <- true
return err
}
}
}
}
func (rs *resourceServer) Restart() error {
log.Printf("restarting %s device plugin server...", rs.resourceName)
if rs.rsConnector == nil || rs.rsConnector.GetServer() == nil {
return fmt.Errorf("grpc server instance not found for %s", rs.resourceName)
}
rs.rsConnector.Stop()
rs.rsConnector.DeleteServer()
// Send terminate signal to ListAndWatch()
rs.stop <- true
return rs.Start()
}
cc @adrianchiris
What is the K8s version you are using ?
i see in the logs:
Using Deprecated Device Plugin Registry path
does the following path exist in your system: /var/lib/kubelet/plugins_registry
?
please check #82 it should solve the issue.
v1.4.0 is out please check :)
@adrianchiris v1.4.0 release seems to be broken, cannot find the release:
docker pull nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 Error response from daemon: manifest for nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 not found: manifest unknown: manifest unknown
Nor can be seen here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/k8s-rdma-shared-dev-plugin/tags
image is here: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/pkgs/container/k8s-rdma-shared-dev-plugin