extensions icon indicating copy to clipboard operation
extensions copied to clipboard

feat: glibc extension

Open jfroy opened this issue 1 year ago • 5 comments

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit).

The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

  • A patch to the nvidia container toolkit is required to replace the shell script wrappers with a Go wrapper. See https://github.com/NVIDIA/nvidia-container-toolkit/pull/700.

jfroy avatar Sep 18 '24 16:09 jfroy

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit).

The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

* A patch to the nvidia container toolkit is required to replace the shell script wrappers with a Go wrapper. See [Replace shell wrapper with a Go wrapper NVIDIA/nvidia-container-toolkit#700](https://github.com/NVIDIA/nvidia-container-toolkit/pull/700).

This is really cool and looking forward to actually not having to maintain patches or wrappers. What i still don't understand is about kernel modules, would SideroLabs be still shipping them as extensions (I believe that's the case since only machined in talos can load modules)?

frezbo avatar Sep 18 '24 16:09 frezbo

This PR adds a glibc extension. The intention is to replace the nvidia extensions entirely and only provide the glibc components required by the nvidia gpu-operator and its components (e.g. the nvidia container toolkit). The extension is pretty much copied from the package that is in the nvidia extensions, with one major modification: a symbolic link to ldconfig is installed at /sbin/ldconfig. This change allows the nvidia gpu-operator to work without modification*. This change does require a patch to the extension validation logic, which is provided in separate PRs.

* A patch to the nvidia container toolkit is required to replace the shell script wrappers with a Go wrapper. See [Replace shell wrapper with a Go wrapper NVIDIA/nvidia-container-toolkit#700](https://github.com/NVIDIA/nvidia-container-toolkit/pull/700).

This is really cool and looking forward to actually not having to maintain patches or wrappers. What i still don't understand is about kernel modules, would SideroLabs be still shipping them as extensions (I believe that's the case since only machined in talos can load modules)?

I am sending more PRs, but this is one of the major changes from an architecture POV: the gpu-operator would be allowed to load and unload kernel modules, which means enabling module unloading in the kernel (see https://github.com/siderolabs/pkgs/pull/1031) and not removing SYS_MODULE from containers (see https://github.com/siderolabs/talos/pull/9339).

Another important note is that only CDI mode (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html) works on Talos with this patch. The "legacy" runtime hook require more libraries to be present on the system, whereas the CDI hook is a pure Go program that only requires the glibc dynamic loader and /sbin/ldconfig.

jfroy avatar Sep 18 '24 17:09 jfroy

Another important note is that only CDI mode (see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/cdi.html) works on Talos with this patch. The "legacy" runtime hook require more libraries to be present on the system, whereas the CDI hook is a pure Go program that only requires the glibc dynamic loader and /sbin/ldconfig.

We're open to moving to using CDI :+1:

frezbo avatar Sep 18 '24 17:09 frezbo

See https://github.com/siderolabs/talos/pull/9339 for main discussion on loading kernel modules.

jfroy avatar Sep 18 '24 17:09 jfroy

Patch has been updated to rework the glibc subtree to look like a merged /usr root.

jfroy avatar Sep 21 '24 06:09 jfroy

I'm going to cut v1.9.0-alpha.0 of Talos, bump the extension validator, and then get back to this PR, thank you for your patience!

smira avatar Sep 26 '24 18:09 smira

I'm going to cut v1.9.0-alpha.0 of Talos, bump the extension validator, and then get back to this PR, thank you for your patience!

No rush, thank you for considering these PRs!

jfroy avatar Sep 26 '24 18:09 jfroy

@jfroy is this good to go, I see the referenced PR for nvidia-container-toolkit is still not merged :thinking:

frezbo avatar Oct 18 '24 16:10 frezbo

@jfroy is this good to go, I see the referenced PR for nvidia-container-toolkit is still not merged :thinking:

Yeah I'm working on that internally. The team is just very busy with other work. We are focusing in particular on CDI and DRA. Sidero could package the toolkit in an extension -- the wrappers are created by an installer component that is invoked by the operator and are not an intrinsic part of the toolkit.

In any case, this patch is needed for the CDI hook to run, so it's good to pick up no matter what.

jfroy avatar Oct 18 '24 16:10 jfroy

/m

frezbo avatar Oct 18 '24 16:10 frezbo

I'll wait over the weekend and see if our daily runs for nvidia tests works with this change

frezbo avatar Oct 18 '24 16:10 frezbo