k8s-rdma-shared-dev-plugin icon indicating copy to clipboard operation
k8s-rdma-shared-dev-plugin copied to clipboard

RDMA allocatable resources changed to 0 after kubelet restart

Open WulixuanS opened this issue 1 year ago • 6 comments

Version: v1.3.2

RDMA device plugin log: 企业微信截图_e57917c9-5e83-410b-9583-42354443b0a5

As can be seen from the log, when kubelet restart, it triggers context canceled and restart will block because channel size is 0, context listener added in this issue: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/pull/51.

When the kubelet restarts, ListAndWatch will receive the event from the stop channel, there is no need to watch context, so I fixed the bug by removing the context listener. If necessary, i can submit a PR.

func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
        resp := new(pluginapi.ListAndWatchResponse)


        // Send initial list of devices
        if err := rs.sendDevices(resp, s); err != nil {
                return err
        }


        for {
                select {
                case <-s.Context().Done():
                        log.Printf("ListAndWatch stream close: %v", s.Context().Err())
                        return nil
                case <-rs.stop:
                        return nil
                case d := <-rs.health:
                        // FIXME: there is no way to recover from the Unhealthy state.
                        d.Health = pluginapi.Unhealthy
                        _ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
                case <-rs.updateResource:
                        if err := rs.sendDevices(resp, s); err != nil {
                                // The old stream may not be closed properly, return to close it
                                // and pass the update event to the new stream for processing
                                rs.updateResource <- true
                                return err
                        }
                }
        }
}



func (rs *resourceServer) Restart() error {
        log.Printf("restarting %s device plugin server...", rs.resourceName)
        if rs.rsConnector == nil || rs.rsConnector.GetServer() == nil {
                return fmt.Errorf("grpc server instance not found for %s", rs.resourceName)
        }


        rs.rsConnector.Stop()
        rs.rsConnector.DeleteServer()


        // Send terminate signal to ListAndWatch()
        rs.stop <- true


        return rs.Start()
}

WulixuanS avatar Jul 11 '23 03:07 WulixuanS

cc @adrianchiris

WulixuanS avatar Jul 11 '23 03:07 WulixuanS

What is the K8s version you are using ?

i see in the logs:

Using Deprecated Device Plugin Registry path

does the following path exist in your system: /var/lib/kubelet/plugins_registry ?

adrianchiris avatar Aug 13 '23 16:08 adrianchiris

please check #82 it should solve the issue.

adrianchiris avatar Aug 14 '23 14:08 adrianchiris

v1.4.0 is out please check :)

adrianchiris avatar Dec 31 '23 12:12 adrianchiris

@adrianchiris v1.4.0 release seems to be broken, cannot find the release: docker pull nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 Error response from daemon: manifest for nvcr.io/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.4.0 not found: manifest unknown: manifest unknown

Nor can be seen here: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/k8s-rdma-shared-dev-plugin/tags

hvp4 avatar Jan 08 '24 23:01 hvp4

image is here: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/pkgs/container/k8s-rdma-shared-dev-plugin

adrianchiris avatar Jan 09 '24 11:01 adrianchiris