elemental-toolkit icon indicating copy to clipboard operation
elemental-toolkit copied to clipboard

Support loading driver/module in runtime

Open bk201 opened this issue 2 years ago • 13 comments

Is your feature request related to a problem? Please describe.

In Harvester, there is a need to load custom modules/drivers to support more hardware (e.g., Nvidia GPU drivers and tools). The toolkit provides the ability to run arbitrary commands in the after-install-chroot stage, but users might install hardware in day2 or do driver upgrade, which might lead to the need to install drivers in runtime.

Describe the solution you'd like

The ability to load drivers in runtime.

Describe alternatives you've considered

Additional context

An idea might be shipping the driver within a container image and loading it to the host.

Related issues: https://github.com/harvester/harvester/issues/2764

bk201 avatar Mar 08 '23 11:03 bk201

@pgonin @agracey This is the feature request I've talked with you about.

With the current build time load approach, we're facing two issues:

  1. Many vendor-supported kernel modules are behind the paywall (Nvidia GPU driver, Dell PowerFlex CSI driver). So we won't be able to ship their kernel modules in our ISO.
  2. In Harvester, we're shipping the same ISO for all the customers. We won't be able to foresee the driver that the user might need beforehand so it's hard for us to include that in during the build time either.

yasker avatar Mar 08 '23 16:03 yasker

At a glance few strategies come to my mind:

  1. Adding modules as part of or using the same strategy as it was an OS upgrade. That means using after-install-hook, after-upgrade and after-reset-hook.
  2. Adding additional paths with a persistent rw overlay, that way under some constraints, extra files could be included.
  3. Directly remounting root in RW mode, install whatever is needed and go back to RO.
  4. Install on ephemeral area during every boot by using cloud-init. That implies having further RW ephemeral paths.

All of these come with tradeoffs and constraints. So my first question would be, what runtime means in this context, are reboots acceptable?

Let me quickly discuss into tradeoffs or limitations of the four mentioned strategies I though:

  1. It think this is the one I like the most. So why not treating this additions as OS upgrades? So we just upgrade to the same OS system but including additional after-upgrade-hook, after-reset-hook and after-install-hook. So basically adapt hooks, call elemental upgrade and reboot. Note this can easily be a way to value fallback partition in case of errors, we could even consider making the flow more flexible in that regard in case few iterations are required (e.g. I don't want to keep in fallback a previous wrong iteration). I like this approach because it comes with a simple constraint only, requires a reboot.
  2. We can make some paths persistent and writable by tweaking the layout configuration. So to speak we could make /lib RW and presistent. I dislike the idea, /lib won't be enough, soon there will be some module that requires extra files in some other place or someone will expect to simply add modules by installing an RPM (writing in rpmdb, docs, license, etc.). Moreover this defeats the immutability concept, if at runtime we are capable to mess with sensitive contents such as /lib or /usr/lib where is the point on having RO paths?
  3. This options has caveats too, first, probably not relevant to us, it would not be possible if the installation is on top of a squash filesystem. Second the OS runs on top of a loop device of a limited size, in current head, by default, the loop device is set with an overhead of 256MB of free space on top of the OS tree, so adding heavy software here requires and installation that already considers it and sets the loop device to a fixed suitable value. Third problem with this approach is that the loop device at boot (on main elemental-cli) is managed by systemd and systemd sets the loop device in RO mode, I am not confident this can be changed on fly, so probably this approach requires a reboot too.
  4. This option is similar to the 2, but setting the RW paths on a tmpfs. Problems is that the installation should be instructed to happen at every boot, probably this is not acceptable if it takes a while and significantly slows down the boot time. It also requires extra memory to hold all the binaries in tmpfs. It is likely to suffer from glitches here and there (e.g. missing some RW paths) and also likely allow config data loss (e.g. it was working fine, reboots and it is not working anymore because admin forgot to set in cloud-init some tweaks done at runtime and no one remembers what it was).

So my personal take is that these additions should be driven as OS upgrades, other approaches feel hacky to me. How is harvester delivering OS upgrades?

davidcassany avatar Mar 09 '23 06:03 davidcassany

So my first question would be, what runtime means in this context, are reboots acceptable?

I guess it's acceptable. One scenario is to add new hardware on day 2, the user should put the node into maintenance mode first.

bk201 avatar Mar 09 '23 09:03 bk201

I think best if we can avoid the reboot. It’s going to be draining and rebooting the whole cluster instead of one node, in case the new module is not for a piece of hardware but eg. CSI driver. Also, keep a different state for one node (in case only one node has aGPU) is tricky.

yasker avatar Mar 09 '23 13:03 yasker

I wonder if including the driver in the base image but keeping it deactivated until the user agrees to the T&Cs would be possible (legally)? Then it would only be a modprobe to enable

agracey avatar Mar 09 '23 18:03 agracey

My concern with that approach are we allow to bundle e.g. Nvidia GPU drivers/Dell PowerFlex client modules in our ISO. Would that be in violation of their T&C? Those drivers are normally behind paywall.

yasker avatar Mar 09 '23 19:03 yasker

I agree trying to include everything within the ISO won't be an appropriate solution, even if all drivers/modules/whatever is accessible and without licensing issues we just can't pretend to match all use cases. There will always be some specific use cases we did not think about.

The strategies I mentioned above are not implying new features but sophisticated or complex use of current options. I'd say we could eventually try to properly list the steps for options 1 & 4 and see how this fits within harvester use cases. I'll briefly write some step by step for both here within the issue. These should be the possible off the shelf solutions, ready to be used.

On a longer view I am foreseeing additional immutable-rootfs options. Lets imagine we set a persistent overlay on top of a current RO path, apply changes and then turn it again into RO mode and apply this RO mode layer into default setup for any follow up reboot. This, IMO, has few engineering challenges mostly on how to describe and implement it from a UX point of view. Also for more sophisticated approaches like this I'd say we have to go beyond the current bash implementation and enviornment variables setup, some more structured configuration syntax would be required. Current immutable-rootfs configuration, IMHO, is already non obvious and rough. To expand its features we should code it better with proper testing and so on. This dracut module is a very sensitive piece of code and it is currently hard to debug, test and follow its logic.

In short, I'll try to elaborate around current options here in this issue, meanwhile on the other side I'd be happy trying to outline/design which features are missing within the immutable-rootfs modue/code. However I don't see it as something to achieve in short term.

davidcassany avatar Mar 09 '23 19:03 davidcassany

  1. We can make some paths persistent and writable by tweaking the layout configuration. So to speak we could make /lib RW and presistent. I dislike the idea, /lib won't be enough, soon there will be some module that requires extra files in some other place or someone will expect to simply add modules by installing an RPM (writing in rpmdb, docs, license, etc.). Moreover this defeats the immutability concept, if at runtime we are capable to mess with sensitive contents such as /lib or /usr/lib where is the point on having RO paths?

We are seeing another challenge to this approach. If the user installs RPMs that have a dependency, how can he access the repo in IBS? I think this is a quite common case for kmod or dkms RPMs. Can the user just use the SLES 15.4 repo? I feel it's different from SLE Micro Rancher 5.3 repo.

So my personal take is that these additions should be driven as OS upgrades, other approaches feel hacky to me. How is harvester delivering OS upgrades?

This also means for each new release, we need to make sure those shipped drivers match the new kernel version too.

bk201 avatar Mar 16 '23 08:03 bk201

If the user installs RPMs

This is not supported. It totally defeats the purpose of an immutable, image-based operating system.

This also means for each new release, we need to make sure those shipped drivers match the new kernel version too.

Right. That's how it is with Linux kernel drivers.

However, this is only required when Elemental moves to a new SLE Micro release. Like from SLE Micro 5.3 to SLE Micro 5.4

Normal (maintenance) kernel updates would keep existing drivers.

kkaempf avatar Mar 16 '23 12:03 kkaempf

@ibrokethecloud mentioned Nvidia has the way to ship the driver in the container image: https://gitlab.com/nvidia/container-images/driver/-/tree/main/sle15

bk201 avatar Mar 17 '23 01:03 bk201

If the user installs RPMs that have a dependency, how can he access the repo in IBS? I think this is a quite common case for kmod or dkms RPMs. Can the user just use the SLES 15.4 repo? I feel it's different from SLE Micro Rancher 5.3 repo.

I'd say this is out of the scope of the elemental-toolkit. How the user is actually accessing to extra software is not under our control. Options evaluated here are based on the assumption the user has access to the additional required software.

How OS upgrades are managed and delivered in Harvester through the cluster nodes? Are you coupling OS upgrades with new Harvester releases/versions?

Elemental does not perform any logic on versioning when applying elemental upgrade, so it is possible to upgrade to the same exact system in which you could eventually add upgrade hooks to include extra software. So then the new image after rebooting includes all the needed stuff. In that process only the RO OS image would be changed, no changes on persistent data.

Installing extra software within a node OS upgrade

  1. Get the software in an accessible location (locally under a persistent path or having remote access)
  2. Create upgrade hooks in /oem, something similar to:
stages:
  after-upgrade-chroot:
  - name: "Installing drivers"
    commands:
    - |
    # Installation script goes here
    # for instance: 
    # zypper addrepo ... && zypper in ...  
    # wget -O - <script_url> | bash
  1. Run an elemental upgrade: Assuming there is the OS in a public registry: elemental --debug upgrade --system.uri docker:<image_reg>:<image_tag>

    Assuming we want to use our current active image: elemental --debug upgrade --system.uri file:///run/initramfs/cos-state/cOS/active.img

  2. Reboot

Then this procedure in Rancher could be distributed as a plan to clusters and its nodes in an orchestrated fashion by using fleet, rancher-system-agent and system-upgrade-controller.

davidcassany avatar Mar 17 '23 13:03 davidcassany

https://gitlab.com/nvidia/container-images/driver/-/tree/main/sle15

IMO, This needs to be the preferred way then allow for adding to the host as a fallback.

Is the method used by the nvidia container something that can be generalized to all drivers?

agracey avatar Mar 18 '23 17:03 agracey

How OS upgrades are managed and delivered in Harvester through the cluster nodes? Are you coupling OS upgrades with new Harvester releases/versions?

Yes, currently an OS version needs to go with a Harvester version. And we support upgrades to exactly the same version too. We can't use SUC to upgrade the system, the underlying RKE2 is managed by Rancher and we need to use provisioningv2 with hooks (see phase 4) to upgrade each node (also to do live migration things).

Elemental's upgrade hooks are likely to work. Just we are looking for ways to load drivers in runtime.

bk201 avatar Mar 20 '23 01:03 bk201