k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

cannot generate nvidia.com/gpu.xxx labels on node

Open double12gzh opened this issue 1 year ago • 5 comments

Description

with node-feature-discover, gpu-feature-discovery and nvidia-device-plugin deployed, it is expected that some labels, such as, nvidia.com/gpu.product, nvidia.com/gpu.replica and etc. But there are no such labels on node.

How to find the root cause

  1. exec into gpu-feature-discovery , found there is no content in /etc/kubernetes/node-feature-discovery/features.d/gfd.
  2. in gpu-feature-discovery, cannot find nvml lib and nvidia-smi comand image

Others

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

double12gzh avatar Mar 01 '24 10:03 double12gzh

@double12gzh how is k8s configured to make use of GPUs? If you're using containerd or cri-o, you will have to configure the NVIDIA Container Runtime for each of these. In addition -- if this is not the default runtime in either case, you will have to create a runtimeClass in k8s and deploy the device plugin and GFD using this runtime class.

if I delete po gpu-feature-discovery and it is recreated, then labels will be correctly added on node.

Do you mean that if GFD is restarted the labels are generated? Note that GFD should trigger a regeneration of lables after a certain amout of time. This should retrigger the NVML initialization, but if the driver was not available when the container was started the container would still not detect the devices.

elezar avatar Mar 01 '24 12:03 elezar

Great thanks for you kind reply.

Yes, We are using containerd and configured "nvidia" as the default runtime. image

Actually, I run command kubectl delete pod -n xxx gpu-feature-discovery and wait the pod is successfully created. Then I find that the NVML is initialized correctly and the labels are successfully labeled on node.

If I don't delete GFD pod, the NVML will be never be successfully initialized, though the detection is run periodly.

double12gzh avatar Mar 01 '24 13:03 double12gzh

@double12gzh how is the driver installed on your system? Is it preinstalled and available at the point in time where the pods are started for the first time?

elezar avatar Mar 01 '24 13:03 elezar

yes, it is preinstalled and available before pods are started.

double12gzh avatar Mar 01 '24 13:03 double12gzh

nvidia-smi command is workable on host, but it when pod was firstly create, nvidia-smi is not workable in pod. After I delete the GFD pod and wait it is created, tried to kubectl exec into pod and run nvidia-smi, it works well.

double12gzh avatar Mar 01 '24 13:03 double12gzh

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 26 '24 04:08 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Sep 25 '24 04:09 github-actions[bot]