sriov-network-device-plugin icon indicating copy to clipboard operation
sriov-network-device-plugin copied to clipboard

Support device health-check

Open adrianchiris opened this issue 4 years ago • 3 comments

What would you like to be added?

Support periodically checking for device health and notifying kubelet on changes to devices via ListAndWatch rpc call

What is the use case for this feature / enhancement?

devices may become un-healthy, e.g a resource was consumed by workload during which it has become corrupted. we should report this to kubelet to avoid requests for this device for future workloads.

https://github.com/kubernetes/kubernetes/blob/234d7311822aecb8c5f4115107007b8420d9316b/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L58

adrianchiris avatar Jul 13 '21 15:07 adrianchiris

Isn't it a bug as it is mentioned as a supported feature?: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin#features

TothFerenc avatar Jul 13 '21 15:07 TothFerenc

It seems that there is some handler for updateSignal, but I assume that has to be issued by kubelet (?). Do you want to make DP 'proactively' scan the devices health status and then pass that info to kubelet on it's own?

The other question for me is what about the plans to make DP to track the devices (like in the issue 276) - should the DP then also track the health status of the 'consumed' devices? I am wondering is that even achievable as the devices are moved to the container's namespace?

ipatrykx avatar Jul 21 '21 11:07 ipatrykx

Isn't it a bug as it is mentioned as a supported feature?:

Maybe a documentation bug :) , i dont remember having this logic in DP.

@ipatrykx i think we should first define what is a healthy device.

a good start IMO is: a device is considered healthy if all relevant resources for that device are present in the system. I am unsure how to check this for allocated devices.

adrianchiris avatar Jul 27 '21 12:07 adrianchiris