k8s-device-plugin
k8s-device-plugin copied to clipboard
Make GPU Feature Discovery more robust to device failures
When enumerating devices, a single device that has errors causes GFD to fail -- skipping any remaining devices. This manifests as errors similar to:
E1105 17:45:39.017442 1 main.go:110] error creating labeler: error getting devices: error getting device handle for index '0': Unknown Error
This change pulls in changes from go-nvlib (vendored in locally) (see https://github.com/NVIDIA/go-nvlib/pull/80), that allow errors in enumerating devices to be ignored and ensures that the device lib is constructed with the required option. A simple unit test demonstrates how these errors are handled to ensure that labels are still generated.