k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Make GPU Feature Discovery more robust to device failures

Open elezar opened this issue 1 month ago • 0 comments

When enumerating devices, a single device that has errors causes GFD to fail -- skipping any remaining devices. This manifests as errors similar to:

E1105 17:45:39.017442       1 main.go:110] error creating labeler: error getting devices: error getting device handle for index '0': Unknown Error

This change pulls in changes from go-nvlib (vendored in locally) (see https://github.com/NVIDIA/go-nvlib/pull/80), that allow errors in enumerating devices to be ignored and ensures that the device lib is constructed with the required option. A simple unit test demonstrates how these errors are handled to ensure that labels are still generated.

elezar avatar Dec 10 '25 10:12 elezar