elemental-toolkit
elemental-toolkit copied to clipboard
Support loading driver/module in runtime
Is your feature request related to a problem? Please describe.
In Harvester, there is a need to load custom modules/drivers to support more hardware (e.g., Nvidia GPU drivers and tools).
The toolkit provides the ability to run arbitrary commands in the after-install-chroot stage, but users might install hardware in day2 or do driver upgrade, which might lead to the need to install drivers in runtime.
Describe the solution you'd like
The ability to load drivers in runtime.
Describe alternatives you've considered
Additional context
An idea might be shipping the driver within a container image and loading it to the host.
Related issues: https://github.com/harvester/harvester/issues/2764
@pgonin @agracey This is the feature request I've talked with you about.
With the current build time load approach, we're facing two issues:
- Many vendor-supported kernel modules are behind the paywall (Nvidia GPU driver, Dell PowerFlex CSI driver). So we won't be able to ship their kernel modules in our ISO.
- In Harvester, we're shipping the same ISO for all the customers. We won't be able to foresee the driver that the user might need beforehand so it's hard for us to include that in during the build time either.
At a glance few strategies come to my mind:
- Adding modules as part of or using the same strategy as it was an OS upgrade. That means using
after-install-hook,after-upgradeandafter-reset-hook. - Adding additional paths with a persistent rw overlay, that way under some constraints, extra files could be included.
- Directly remounting root in RW mode, install whatever is needed and go back to RO.
- Install on ephemeral area during every boot by using cloud-init. That implies having further RW ephemeral paths.
All of these come with tradeoffs and constraints. So my first question would be, what runtime means in this context, are reboots acceptable?
Let me quickly discuss into tradeoffs or limitations of the four mentioned strategies I though:
- It think this is the one I like the most. So why not treating this additions as OS upgrades? So we just upgrade to the same OS system but including additional
after-upgrade-hook,after-reset-hookandafter-install-hook. So basically adapt hooks, callelemental upgradeand reboot. Note this can easily be a way to value fallback partition in case of errors, we could even consider making the flow more flexible in that regard in case few iterations are required (e.g. I don't want to keep in fallback a previous wrong iteration). I like this approach because it comes with a simple constraint only, requires a reboot. - We can make some paths persistent and writable by tweaking the layout configuration. So to speak we could make
/libRW and presistent. I dislike the idea,/libwon't be enough, soon there will be some module that requires extra files in some other place or someone will expect to simply add modules by installing an RPM (writing in rpmdb, docs, license, etc.). Moreover this defeats the immutability concept, if at runtime we are capable to mess with sensitive contents such as/libor/usr/libwhere is the point on having RO paths? - This options has caveats too, first, probably not relevant to us, it would not be possible if the installation is on top of a squash filesystem. Second the OS runs on top of a loop device of a limited size, in current head, by default, the loop device is set with an overhead of 256MB of free space on top of the OS tree, so adding heavy software here requires and installation that already considers it and sets the loop device to a fixed suitable value. Third problem with this approach is that the loop device at boot (on main elemental-cli) is managed by systemd and systemd sets the loop device in RO mode, I am not confident this can be changed on fly, so probably this approach requires a reboot too.
- This option is similar to the 2, but setting the RW paths on a tmpfs. Problems is that the installation should be instructed to happen at every boot, probably this is not acceptable if it takes a while and significantly slows down the boot time. It also requires extra memory to hold all the binaries in tmpfs. It is likely to suffer from glitches here and there (e.g. missing some RW paths) and also likely allow config data loss (e.g. it was working fine, reboots and it is not working anymore because admin forgot to set in cloud-init some tweaks done at runtime and no one remembers what it was).
So my personal take is that these additions should be driven as OS upgrades, other approaches feel hacky to me. How is harvester delivering OS upgrades?
So my first question would be, what runtime means in this context, are reboots acceptable?
I guess it's acceptable. One scenario is to add new hardware on day 2, the user should put the node into maintenance mode first.
I think best if we can avoid the reboot. It’s going to be draining and rebooting the whole cluster instead of one node, in case the new module is not for a piece of hardware but eg. CSI driver. Also, keep a different state for one node (in case only one node has aGPU) is tricky.
I wonder if including the driver in the base image but keeping it deactivated until the user agrees to the T&Cs would be possible (legally)? Then it would only be a modprobe to enable
My concern with that approach are we allow to bundle e.g. Nvidia GPU drivers/Dell PowerFlex client modules in our ISO. Would that be in violation of their T&C? Those drivers are normally behind paywall.
I agree trying to include everything within the ISO won't be an appropriate solution, even if all drivers/modules/whatever is accessible and without licensing issues we just can't pretend to match all use cases. There will always be some specific use cases we did not think about.
The strategies I mentioned above are not implying new features but sophisticated or complex use of current options. I'd say we could eventually try to properly list the steps for options 1 & 4 and see how this fits within harvester use cases. I'll briefly write some step by step for both here within the issue. These should be the possible off the shelf solutions, ready to be used.
On a longer view I am foreseeing additional immutable-rootfs options. Lets imagine we set a persistent overlay on top of a current RO path, apply changes and then turn it again into RO mode and apply this RO mode layer into default setup for any follow up reboot. This, IMO, has few engineering challenges mostly on how to describe and implement it from a UX point of view. Also for more sophisticated approaches like this I'd say we have to go beyond the current bash implementation and enviornment variables setup, some more structured configuration syntax would be required. Current immutable-rootfs configuration, IMHO, is already non obvious and rough. To expand its features we should code it better with proper testing and so on. This dracut module is a very sensitive piece of code and it is currently hard to debug, test and follow its logic.
In short, I'll try to elaborate around current options here in this issue, meanwhile on the other side I'd be happy trying to outline/design which features are missing within the immutable-rootfs modue/code. However I don't see it as something to achieve in short term.
- We can make some paths persistent and writable by tweaking the layout configuration. So to speak we could make
/libRW and presistent. I dislike the idea,/libwon't be enough, soon there will be some module that requires extra files in some other place or someone will expect to simply add modules by installing an RPM (writing in rpmdb, docs, license, etc.). Moreover this defeats the immutability concept, if at runtime we are capable to mess with sensitive contents such as/libor/usr/libwhere is the point on having RO paths?
We are seeing another challenge to this approach. If the user installs RPMs that have a dependency, how can he access the repo in IBS? I think this is a quite common case for kmod or dkms RPMs. Can the user just use the SLES 15.4 repo? I feel it's different from SLE Micro Rancher 5.3 repo.
So my personal take is that these additions should be driven as OS upgrades, other approaches feel hacky to me. How is harvester delivering OS upgrades?
This also means for each new release, we need to make sure those shipped drivers match the new kernel version too.
If the user installs RPMs
This is not supported. It totally defeats the purpose of an immutable, image-based operating system.
This also means for each new release, we need to make sure those shipped drivers match the new kernel version too.
Right. That's how it is with Linux kernel drivers.
However, this is only required when Elemental moves to a new SLE Micro release. Like from SLE Micro 5.3 to SLE Micro 5.4
Normal (maintenance) kernel updates would keep existing drivers.
@ibrokethecloud mentioned Nvidia has the way to ship the driver in the container image: https://gitlab.com/nvidia/container-images/driver/-/tree/main/sle15
If the user installs RPMs that have a dependency, how can he access the repo in IBS? I think this is a quite common case for kmod or dkms RPMs. Can the user just use the SLES 15.4 repo? I feel it's different from SLE Micro Rancher 5.3 repo.
I'd say this is out of the scope of the elemental-toolkit. How the user is actually accessing to extra software is not under our control. Options evaluated here are based on the assumption the user has access to the additional required software.
How OS upgrades are managed and delivered in Harvester through the cluster nodes? Are you coupling OS upgrades with new Harvester releases/versions?
Elemental does not perform any logic on versioning when applying elemental upgrade, so it is possible to upgrade to the same exact system in which you could eventually add upgrade hooks to include extra software. So then the new image after rebooting includes all the needed stuff. In that process only the RO OS image would be changed, no changes on persistent data.
Installing extra software within a node OS upgrade
- Get the software in an accessible location (locally under a persistent path or having remote access)
- Create upgrade hooks in
/oem, something similar to:
stages:
after-upgrade-chroot:
- name: "Installing drivers"
commands:
- |
# Installation script goes here
# for instance:
# zypper addrepo ... && zypper in ...
# wget -O - <script_url> | bash
-
Run an elemental upgrade: Assuming there is the OS in a public registry:
elemental --debug upgrade --system.uri docker:<image_reg>:<image_tag>Assuming we want to use our current active image:
elemental --debug upgrade --system.uri file:///run/initramfs/cos-state/cOS/active.img -
Reboot
Then this procedure in Rancher could be distributed as a plan to clusters and its nodes in an orchestrated fashion by using fleet, rancher-system-agent and system-upgrade-controller.
https://gitlab.com/nvidia/container-images/driver/-/tree/main/sle15
IMO, This needs to be the preferred way then allow for adding to the host as a fallback.
Is the method used by the nvidia container something that can be generalized to all drivers?
How OS upgrades are managed and delivered in Harvester through the cluster nodes? Are you coupling OS upgrades with new Harvester releases/versions?
Yes, currently an OS version needs to go with a Harvester version. And we support upgrades to exactly the same version too. We can't use SUC to upgrade the system, the underlying RKE2 is managed by Rancher and we need to use provisioningv2 with hooks (see phase 4) to upgrade each node (also to do live migration things).
Elemental's upgrade hooks are likely to work. Just we are looking for ways to load drivers in runtime.