k8s-rdma-shared-dev-plugin icon indicating copy to clipboard operation
k8s-rdma-shared-dev-plugin copied to clipboard

RDMA allocatable resources changed to 0 and couldn't be updated

Open 913871734 opened this issue 1 year ago • 1 comments

We found that the plugin show that the num of devices is 0. Then I started to position why it is 0.

  1. I checked the log of the plugin, and found that there always exposing 1k devices( non-zero), which shows the plugin works normally. image

  2. I checked the source code of the plugin, and found that the plugin would count the number of devices per period, the most import logic is that the plugin will compare the number of devices with the value of the previous cycle, and only update it when the value changes and report it to kubelet. There is a situation that an error may occur during kubelet communication, causing the kubectl client made an erroneous communication which causes the client to obtain a value of 0. However, since the plugin 'update' is only pushed when the actual value changes, the client value cannot be updated normally. Finally, actually the number of devices haven't been changed, but the client's value couldn't be updated to correct value.

image image

// ListAndWatch lists devices and update that list according to the health status
func (rs *resourceServer) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    log.Printf("ListAndWatch called by kubelet for: %s", rs.resourceName)
    resp := new(pluginapi.ListAndWatchResponse)

    // Send initial list of devices
    if err := rs.sendDevices(resp, s); err != nil {
        return err
    }

    rs.mutex.RLock()
    err := rs.updateCDISpec()
    rs.mutex.RUnlock()
    if err != nil {
        log.Printf("cannot update CDI specs: %v", err)
        return err
    }

    for {
        select {
        case <-s.Context().Done():
            log.Printf("ListAndWatch stream close: %v", s.Context().Err())
            return nil
        case d := <-rs.health:
            // FIXME: there is no way to recover from the Unhealthy state.
            d.Health = pluginapi.Unhealthy
            _ = s.Send(&pluginapi.ListAndWatchResponse{Devices: rs.devs})
        case <-rs.updateResource:
            if err := rs.sendDevices(resp, s); err != nil {
                // The old stream may not be closed properly, return to close it
                // and pass the update event to the new stream for processing
                rs.updateResource <- true
                return err
            }
            err := rs.updateCDISpec()
            if err != nil {
                log.Printf("cannot update CDI specs: %v", err)
                return err
            }
        }
    }
}

Therefore, I suggest whether we can change the update mechanism to add a forced push mechanism. For example, when the values ​​are the same within a specified number of cycles, a forced push update will also be performed.

913871734 avatar Jun 14 '24 06:06 913871734

Seen same issue here. In my case, kubelet is down for a few hours due to network issue and after resume, it no longer repost the previous registered devices and rdma plugin fails to detect since it assumes no change. So I added liveness check to check whether rdma is registered in /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint as workaround which can be another option if force update periodically as reported above is not desirable.

yifeng-cerebras avatar Mar 31 '25 19:03 yifeng-cerebras