extensions icon indicating copy to clipboard operation
extensions copied to clipboard

Include /usr/bin/nvidia-smi for nvidia-kmod extension

Open rothgar opened this issue 1 year ago • 4 comments

When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the nvidia-validator tries to run the nvidia-smi binary from the host in /usr/bin/

NAMESPACE     NAME                                                          READY   STATUS     RESTARTS      AGE
kube-system   coredns-85b955d87b-9cx56                                      1/1     Running    0             70m
kube-system   coredns-85b955d87b-nfdgb                                      1/1     Running    0             70m
kube-system   gpu-feature-discovery-jn6ps                                   0/1     Init:0/1   0             49m
kube-system   gpu-operator-7bbf8bb6b7-g4pd2                                 1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-gc-79d6d968bb-jkn2s       1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-master-6d9f8d497c-xvttn   1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-6cgnv              1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-tdc8j              1/1     Running    0             50m
kube-system   kube-apiserver-up                                             1/1     Running    0             69m
kube-system   kube-controller-manager-up                                    1/1     Running    1 (70m ago)   68m
kube-system   kube-flannel-ffftw                                            1/1     Running    0             69m
kube-system   kube-flannel-q972c                                            1/1     Running    0             69m
kube-system   kube-proxy-mrc75                                              1/1     Running    0             69m
kube-system   kube-proxy-n5qdc                                              1/1     Running    0             69m
kube-system   kube-scheduler-up                                             1/1     Running    2 (70m ago)   68m
kube-system   nvidia-dcgm-exporter-jlqbb                                    0/1     Init:0/1   0             49m
kube-system   nvidia-device-plugin-daemonset-q89xh                          0/1     Init:0/1   0             49m
kube-system   nvidia-operator-validator-jfs6m                               0/1     Init:0/4   0             49m

I installed the operator via helm with the following values.yaml

driver:
  enabled: false

toolkit:
  enabled: false
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/cri/conf.d/nvidia-container-runtime.part
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.

The chart was installed with

helm install gpu-operator \                                              
    -n kube-system nvidia/gpu-operator --values values.yaml

I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command

running command chroot with args [/run/nvidia/driver nvidia-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory

more information in the repo https://github.com/NVIDIA/gpu-operator/tree/master and installation docs https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide

rothgar avatar May 14 '24 00:05 rothgar

The Talos Nvidia driver extensions installs nvidia-smi under /usr/local/bin, which is somewhat of a non-standard location for an Nvidia driver component (other components are under /usr/local/lib, which is also non-standard; this will come up later if you read on). The current release version of nvidia-validator will not find nvidia-smi at that path. However, the main branch of the operator and operator-validator have significantly different code (to handle driver container images). If you override the image for operator and operator-validator to use one of the daily CI builds on Github, you should get past that issue.

However, you will then find that the device plug-in will not find a core CUDA library as part of its driver detection process. This is because of the aforementioned custom install path for other driver components. Furthermore, Talos applies a patch to the container toolkit to change the ldcache path (which the toolkit uses to find libraries), because Talos needs to maintain separate glibc and musl LD caches and thus stores them in custom locations. You will need to patch the device plug-in, build and publish a custom image, and use that image to get past that issue. Something like this:

diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
index 2f6de2fe..35f62f45 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/ldcache/ldcache.go
@@ -33,7 +33,7 @@ import (
 	"github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/symlinks"
 )
 
-const ldcachePath = "/etc/ld.so.cache"
+const ldcachePath = "/usr/local/glibc/etc/ld.so.cache"
 
 const (
 	magicString1 = "ld.so-1.7.0"
diff --git a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
index 7f5cf7c8..85fd1db9 100644
--- a/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
+++ b/vendor/github.com/NVIDIA/nvidia-container-toolkit/internal/lookup/library.go
@@ -36,6 +36,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 
 	// If search paths are already specified, we return a locator for the specified search paths.
 	if len(b.searchPaths) > 0 {
+		b.logger.Infof("Returning symlink locator with paths: %v", b.searchPaths)
 		return NewSymlinkLocator(
 			WithLogger(b.logger),
 			WithSearchPaths(b.searchPaths...),
@@ -56,6 +57,7 @@ func NewLibraryLocator(opts ...Option) Locator {
 			"/lib/aarch64-linux-gnu",
 			"/lib/x86_64-linux-gnu/nvidia/current",
 			"/lib/aarch64-linux-gnu/nvidia/current",
+			"/usr/local/lib",
 		}...),
 	)
 	// We construct a symlink locator for expected library locations.

With the previously mentioned upcoming support for driver container images in the GPU operator, Talos may want to consider reworking their Nvidia extensions to deliver all the components as container image. That should hopefully provide a more supported and long-term stable solution.

jfroy avatar Jul 01 '24 16:07 jfroy

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension. We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

TimJones avatar Jul 01 '24 19:07 TimJones

Hi @jfroy, one issue here is that Talos requires signed drivers, and the singing key is ephemeral to each build process, hence why each release of Talos has a specific corresponding release of each system extension.

We would love to be able to work through some potential ideas with you if you'd be interested in joining our Slack community, or even be available for a call to run though them.

Yeah I like that Talos provides a chain of trust. You would need a per-release driver container just like you have a per-release extension.

I work at Nvidia, but I only speak for myself here. It would be inappropriate to engage beyond the occasional comment and bug fix PR on GitHub. I will however reach out to the folks working on our container technologies.

jfroy avatar Jul 01 '24 20:07 jfroy

I will however reach out to the folks working on our container technologies.

That would be greatly appreciated, and thank you for reaching out in the first instance.

TimJones avatar Jul 01 '24 20:07 TimJones

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 28 '25 02:06 github-actions[bot]

Based on the work in #476 I don't think this approach would be enough to fix the driver validator and we'll need more configuration options from the operator to get this to work with Talos.

I'll close this while we wait for a more complete solution.

rothgar avatar Jun 30 '25 17:06 rothgar