dracut-ng
dracut-ng copied to clipboard
feat(50-dracut.install): skip installation if zfs required but missing
On some distributions external kernel modules are built after the kernel package has been installed, this is usually required since building the external kernel modules requires part of the sources for that kernel version.
This presents us with the following problem, after the install of the kernel package /sbin/installkernel is run but dracut cannot yet create a functional initramfs if the root is zfs since the zfs module has not been built. Dracut however does not treat this as a fatal failure and creates an initramfs anyway.
The package manager should schedule zfs for (re-)installation for the new kernel version and then (re-)trigger /sbin/installkernel for another run. On this second run a functional initramfs is installed.
However, if there is some interruption between the first and second run of /sbin/installkernel then we are left with a broken kernel+initramfs that is installed and presumably registered with the bootloader. Rebooting the system in this state would create a mess.
I therefore propose the following solution, we check here if the root is zfs, and if we can find the zfs module for this kernel version. If not then we exit with the special exit code 77. This exit code causes kernel- install to skip all remaining plugins and hence effectively prevents the (broken) kernel from being installed and registered with the bootloader. Exit code 77 is not fatal and therefore the process calling /sbin/installkernel (i.e. make or the package manager) will continue the update process and update the external kernel modules.
On the second call to /sbin/installkernel by the package manager the check for zfs will now pass and the working kernel+initramfs is installed.
In distributions that somehow install the zfs module before calling /sbin/installkernel this check would be redundant and does nothing.
If the zfs module is managed via dkms, the dkms hook is run before the dracut hook (number 45 versus 50) and therefore the zfs module should already be present at this stage and the added check does nothing.
We do not have to duplicate this check in the rescue hook since the special exit code 77 would cause it to be skipped anyway if the new check is triggered.
This pull request changes...
Changes
Checklist
- [x] I have tested it locally
- [x] I have reviewed and updated any documentation if relevant
- [ ] I am providing new code and test(s) for it
Fixes #
Wouldn't this make sense to be present in the zfs-dracut package in OpenZFS upstream?
Wouldn't this make sense to be present in the zfs-dracut package in OpenZFS upstream?
Putting this check in the module itself does not help since the whole point of this is raising the 77 exit code to kernel-install. Or in other words, if we put the check in the 90-zfs dracut module we would have to change, that module, dracut itself, and this kernel-install hook, in order to propagate this exit code. This I think would be way more cumbersome compared to just doing the check where the exit code should be raised, i.e. here in the kernel-install hook.
I mean, couldn't you add either a separate module or a hook there to deal with it? After all, kernel-install has a hooks directory for that.
I'm pretty reluctant to include code for out of tree drivers in here.
I mean, couldn't you add either a separate module or a hook there to deal with it? After all,
kernel-installhas a hooks directory for that.
Well yes, but that would change the behaviour of everything and not just dracut. I'm concerned this might have unintended side effects. Note that here I intentionally put the new check after all the other checks that determine if dracut should run.
possibly related - https://forums.gentoo.org/viewtopic-p-8862496.html?sid=b21ca6c9faaffc06d5fad3ac38bae05b
possibly related - https://forums.gentoo.org/viewtopic-p-8862496.html?sid=b21ca6c9faaffc06d5fad3ac38bae05b
Yes! This is exactly the problem I am trying to solve here.
Please let me know if an upstream solution such as this is acceptable.
If not, that's also fine, but then I'll start work on a downstream solution.
Perhaps we should be more ambition and less specific with this PR.
What do I mean ?
Instead of handling zfs module, could we somehow handle all dkms kernel modules that might be needed to boot (e.g. would v4l2loopback kernel module a similar scenario) ?
Could we check of all loaded kernel modules, instead of "just" /proc/mounts
Anyways issues like this Fedora issue is related: - https://discussion.fedoraproject.org/t/fedora-dead-after-kernel-6-15-3-200-update-no-initramfs-kernel-panic-unable-to-mount-root-fs-on-unknown-block-0-0/156457/13
Instead of handling zfs module, could we somehow handle all dkms kernel modules that might be needed to boot (e.g. would v4l2loopback kernel module a similar scenario) ?
Can we make a list of these modules, because I can't think of any other then zfs? V4l2loopback definitely does not belong on this list, in fact it does not need to be in the initramfs at all. The only kernel modules we absolutely need in the initramfs are those we need for finding and mounting the root file system, that means file system and block device drivers and possibly networking drivers. As far as I know, zfs is the only big one in this category.
As a side note, recent versions of dkms already fail fatally if any kernel module build fails during kernel installation. Therefore, as I mentioned in my first comment, we don't need this check for the dkms use-case, but we do need it if the zfs module installation is handled by the package manager.
Anyways issues like this Fedora issue is related: - https://discussion.fedoraproject.org/t/fedora-dead-after-kernel-6-15-3-200-update-no-initramfs-kernel-panic-unable-to-mount-root-fs-on-unknown-block-0-0/156457/13
This is something else, last I checked Fedora was using systemd kernel-install in a somewhat unconventional way where the kernel is installed before the initramfs is generated, leading to issues such as this one. Kernel-install is designed to instead use a staging area which ensures that things are only installed after the initramfs etc have been successfully generated. I left a comment about this on some bug report many months ago after my changes in the kernel-install hook here broke their kernel installation workflow.
I have received yet another issue report about dracut installing a broken initramfs via kernel-install due to zfs being missing.
If there is no progress on this (and #1226) I am going to fork the kernel-install script and maintain it for Gentoo downstream. Note that this also means that I will stop contributing to this upstream script. This is a serious issue, with a simple solution, and I am tired of it being held up.
On some distributions external kernel modules are built after the kernel package has been installed
Perhaps it would be useful to understand what percentage of dracut users using zfs are at risk of running into this problem. Does Fedora or Debian/Ubuntu has the same issue here as Gentoo ?
I am sympathetic for supporting zfs the same way we support - e.g. btrs and do not treat it differently just because it is out of tree dracut module and for reference - dracut is carrying all this zfs-only code for similar reasons https://github.com/dracutdevs/dracut/pull/1711
On some distributions external kernel modules are built after the kernel package has been installed
Perhaps it would be useful to understand what percentage of dracut users using zfs are at risk of running into this problem. Does Fedora or Debian/Ubuntu has the same issue here as Gentoo ?
If they are not using DKMS, then probably yes. As far as I know it is pretty common to have external kernel module packages depend on the kernel itself, and re-triggering initramfs generation for a kernel that is already installed. In fact, I suspect that this is the original reason why Dracut does not treat missing external kernel modules as fatal.
Debian - https://packages.debian.org/sid/zfs-dkms Ubuntu - https://packages.ubuntu.com/plucky/zfs-dkms
Debian - https://packages.debian.org/sid/zfs-dkms Ubuntu - https://packages.ubuntu.com/plucky/zfs-dkms
As far as I know, neither Ubuntu, nor Debian actually use this script. See my other Pull Request which adds a script for the Debian based installkernel. Gentoo on the other hand, is a consumer of this script, and Gentoo (meaning me) has also actually contributed to this script. I am a getting a bit tired of debating hypotheticals when this works around a real issue for Gentoo users, I have already explained why I expect this to be a general problem for the non-DKMS use case.
Let me actually flip this around and ask you to explain to me why Dracut should install an initramfs that we know will not boot because it is missing the driver for the root partition file system? Who are we helping with this? What problem are we solving/preventing? If you do not wish to change the current behaviour, please defend why it is correct.
And no we don't have to make this more general because other drivers that may be missing are usually not fatal for the boot. The driver for the root partition on the other hand, that I can 100% guarantee you will cause boot failure if it is missing...
I truly do not understand why this is so controversial.
If they are not using DKMS, then probably yes.
I failed to parse this sentence and I failed to understand that you meant "As far as I know, neither Ubuntu, nor Debian actually use this script." (I understood the opposite).
Let me actually flip this around and ask you to explain to me why Dracut should install an initramfs that we know will not boot because it is missing the driver for the root partition file system?
Clearly I failed to communicate my intention. I have over 10 PRs waiting for feedback - my real motivation is just try to help out with reviews. Especially on this one - https://github.com/dracut-ng/dracut-ng/pull/1348
My reviews and feedback does not really counts toward to official two review requirement, so I just wanted to seek out more information to make it readily available for other reviewers with more karma.
I truly do not understand why this is so controversial.
I think the fact that neither Debian nor Fedora (seems also nor Arch) is hitting this code path is a critical information for reviews (sadly).
Ubuntu doesn't hit this in the default path because ZFS is bundled in their kernel packages. Debian and Fedora don't hit this because they use DKMS. RHEL and SUSE distributions use kmod packaging, which works around this problem too.
There's a much more straightforward way to solve this: kernel-install should be checking that all filesystems have matching kernel functionality (kmod or built-in) or bail out.
I will not accept a ZFS-specific solution. I will accept a generic one.