Description

with node-feature-discover, gpu-feature-discovery and nvidia-device-plugin deployed, it is expected that some labels, such as, nvidia.com/gpu.product, nvidia.com/gpu.replica and etc. But there are no such labels on node.

How to find the root cause

exec into gpu-feature-discovery , found there is no content in /etc/kubernetes/node-feature-discovery/features.d/gfd.
in gpu-feature-discovery, cannot find nvml lib and nvidia-smi comand

Others

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

Mar 01 '24 10:03 double12gzh

@double12gzh how is k8s configured to make use of GPUs? If you're using containerd or cri-o, you will have to configure the NVIDIA Container Runtime for each of these. In addition -- if this is not the default runtime in either case, you will have to create a runtimeClass in k8s and deploy the device plugin and GFD using this runtime class.

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

Do you mean that if GFD is restarted the labels are generated? Note that GFD should trigger a regeneration of lables after a certain amout of time. This should retrigger the NVML initialization, but if the driver was not available when the container was started the container would still not detect the devices.

Mar 01 '24 12:03 elezar

Great thanks for you kind reply.

Yes, We are using containerd and configured "nvidia" as the default runtime.

Actually, I run command kubectl delete pod -n xxx gpu-feature-discovery and wait the pod is successfully created. Then I find that the NVML is initialized correctly and the labels are successfully labeled on node.

If I don't delete GFD pod, the NVML will be never be successfully initialized, though the detection is run periodly.

Mar 01 '24 13:03 double12gzh

@double12gzh how is the driver installed on your system? Is it preinstalled and available at the point in time where the pods are started for the first time?

Mar 01 '24 13:03 elezar

yes, it is preinstalled and available before pods are started.

Mar 01 '24 13:03 double12gzh

nvidia-smi command is workable on host, but it when pod was firstly create, nvidia-smi is not workable in pod. After I delete the GFD pod and wait it is created, tried to kubectl exec into pod and run nvidia-smi, it works well.

Mar 01 '24 13:03 double12gzh

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

Sep 25 '24 04:09 github-actions[bot]

cannot generate nvidia.com/gpu.xxx labels on node

Description

How to find the root cause

Others