extensions icon indicating copy to clipboard operation
extensions copied to clipboard

feat: nvidia driver extension

Open jfroy opened this issue 1 year ago • 3 comments

This patch deprecates the NVIDIA toolkit extension and introduces a new nvidia-driver extension (in production/lts versions and open source/proprietary flavors). The NVIDIA container toolkit must be installed independently, via a future Talos extension, the NVIDIA GPU Operator, or by the cluster administator.

The extension depends on the new glibc extension (#473) and participates in its filesystem subroot by installing all the NVIDIA components in it.

Finally, the extension runs a service that will bind mount this glibc subroot at /run/nvidia/driver and run the nvidia-persistenced daemon.

This careful setup allows the NVIDIA GPU Operator to utilize this extension as if it were a traditional NVIDIA driver container.

--

I've tested this extension on my homelab cluster with the current release of the NVIDIA GPU Operator, letting the operator install and configure the NVIDIA Container Toolkit (with my Go wrapper patch, https://github.com/NVIDIA/nvidia-container-toolkit/pull/700).

This is the more Talos way of managing NVIDIA drivers, as opposed to letting the GPU Operator load and unload drivers based on its ClusterPolicy or NVIDIADriver custom resources, as discussed in https://github.com/siderolabs/talos/pull/9339 and #473.

This configuration only works in CDI mode, as the "legacy" runtime hook requires more libraries that are removed by this PR.

--

One other requirement on the cluster is to configure the containerd runtime classes. The GPU Operator and the container toolkit installer (which is part of the toolkit and is used by the operator to install the toolkit) have logic to install the runtime classes and patch the containerd config, but this does not work on Talos because the containerd config is synthesized from files that reside on the read-only system partition.

There is a way to install the operator and bypass/disable the containerd configuration. The cluster administrator is then on the hook to do that.

--

There could be a Talos extension for the NVIDIA Container Toolkit. It probably would look a lot like the existing one and maybe even include all the userspace libraries needed for the legacy runtime (basically for nvidia-container-cli). For CDI mode support, a service could invoke nvidia-ctk to generate the CDI spec for the devices present on each node (this is a Go binary that only requires glibc and the driver libraries). However, there is some amount of logic in the GPU Operator to configure the toolkit to work with all the other components that the operator may install and manage on the cluster, so a Talos extension for the toolkit would provide a less integrated, possibly less functional experience.

jfroy avatar Sep 23 '24 04:09 jfroy

Quick fun update, with https://github.com/NVIDIA/gpu-operator/pull/1007 I can launch pods with an NVIDIA GPU without using the custom NVIDIA runtimes, just the default runc.

There are caveats with bypassing/not using the NVIDIA runtime wrapper, but for customers that don't depend on those behaviors, it's a nice setup/maintenance simplification.

  • The runtime.nvidia.com CDI vendor will not work. This is a vendor that triggers a CDI spec generation on the fly and is implemented by the NVIDIA runtime wrapper.
  • Container image environment variables (see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html) will not work as runc will not act on them in any way. This will break deployments that only use a container image with those environment variable and a runtime class invoking the NVIDIA runtime wrapper (either explicitly or because it is the cluster default). Arguably it is more in the spirit of Kubernetes to require those deployments to list/request a GPU resource.

jfroy avatar Sep 24 '24 03:09 jfroy

As another note, the current NVIDIA GPU Operator supports more than an "LTS" and "production" version of the driver stack. Per https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html, there are 4 driver versions supported. With the open-vs-proprietary kernel module choice, that would mean 8 distinct NVIDIA driver system extensions per Talos release, if those numbers don't change. Maybe the extension can be templated to reduce duplicated code, but it does show the appeal of potentially taking a different approach to accelerator drivers, at least the complex ones like GPUs and FPGAs (and maybe also smart NICs / DPUs).

jfroy avatar Sep 24 '24 03:09 jfroy

@jfroy the glibc changes are good, the tests passed, i think we can continue iterating

frezbo avatar Oct 22 '24 08:10 frezbo

Is there still interest in finishing this extension? It seems like the changes would be useful. I'm not sure if it still requires changes to the gpu operator.

rothgar avatar May 23 '25 22:05 rothgar

Is there still interest in finishing this extension? It seems like the changes would be useful. I'm not sure if it still requires changes to the gpu operator.

I continue to work and update it. I just pushed the latest rebase. I'll loop back with the operator team. They've been very busy with DRA and it's impacting how CDI is integrated.

jfroy avatar May 27 '25 15:05 jfroy

This PR is stale because it has been open 45 days with no activity.

github-actions[bot] avatar Jul 12 '25 02:07 github-actions[bot]

I'll refresh this PR soon.

jfroy avatar Jul 14 '25 15:07 jfroy

I'll refresh this PR soon.

Looking forward to start using ctk

frezbo avatar Jul 14 '25 18:07 frezbo

Is there a update on this PR?

mitchross avatar Aug 14 '25 13:08 mitchross

Coming soon! My paternity leave and summer break end this week. Then I’ll have the time.

jfroy avatar Aug 14 '25 13:08 jfroy

This PR is stale because it has been open 45 days with no activity.

github-actions[bot] avatar Sep 29 '25 02:09 github-actions[bot]