distribution icon indicating copy to clipboard operation
distribution copied to clipboard

Security/Critical bug: "Error on update: Failed to install initrd ///usr/lib/initrd.d/00-early-ucode.cpio"

Open GabeAl opened this issue 1 year ago • 7 comments

A few months ago, Clear Linux's default update process started to load some large files (including the -- in some users' case -- totally useless and unwanted i915 firmware set, but that's another issue) on the tiny boot partition it created when it was set up.

Naturally, for servers that are powered down for a period and then are updated, they go through a few update cycles.

The update system is completely, absolutely broken for these cases.

  1. It does not clean up after itself between update cycles initiated in the same sitting. Multiple copies of the i915 firmware and microcode etc are piled next to each other into the boot partition. Naturally, the space is exhausted.
  2. The installer cannot detect or handle this case or any case where the partition fills up. It doesn't check free space. Doesn't check how much it needs to copy. Doesn't check what's already in there. Just blindly dumps things in and chokes.
  3. The installer cannot detect failure from the previous step. It says fatal, and the installer proceeds and says the update was successful, but the kernel is terribly out of date and the boot partition has 0b free.

This is also a critical security flaw because a system that believes it is up date lulls the user into a false sense of security, while in reality the uCode and kernel and i915 stuff (if they use that) is months out of date, and users cannot easily fix/detect it.

This is a big one, which to me seems like a straightforward thing to fix (clean up initrd.d/boot partition in the update and between passes of an update, check for free space and delete old things as needed, and properly propagate critical errors through the installer, which should report it to the user).

image (5)

More details here: https://community.clearlinux.org/t/error-on-update-failed-to-install-initrd-usr-lib-initrd-d-00-early-ucode-cpio/9796

GabeAl avatar Oct 10 '24 20:10 GabeAl

This is an excellent first step.

Can you help with any of the existing systems that are already installed and are suffering from this bug? What can I do to clean up and nudge the update forward?

GabeAl avatar Oct 11 '24 18:10 GabeAl

Can you help with any of the existing systems that are already installed and are suffering from this bug? What can I do to clean up and nudge the update forward?

Root cause is you're out of space in your EFI partition. clr-boot-manager attempts to copy the new files in as tempfiles, then mv them to the final destination atomically, so your /boot is never incomplete. In the short term, you need to make some room in /boot long enough for clr-boot-manager to do that. In your case specifically, I'd temporarily move the Microsoft directory out of /boot, run sudo clr-boot-manager update, then copy the Microsoft directory back in. That 28 MB should be enough to copy a file at a time, I think.

If you don't want the i915 drivers at all, you can mask that cpio completely using these instructions from man clr-boot-manager:

/etc/kernel/initrd.d/* A set of files that will be used as additional user’s freestanding initrd files. Additional initrd arguments will be added to the kernel argument list, if desired the user may mask out system installed initrd files by creating symbolic links within /etc/kernel/initrd.d pointing to /dev/null with same name of system installed files.

This step doesn't currently work -- I found and am fixing a bug in clr-boot-manager

For example: sudo mkdir -p /etc/kernel/initrd.d && sudo ln -s /dev/null /etc/kernel/initrd.d/i915-firmware.cpio -- then delete the i915 cpio from /boot manually, and when you re-run clr-boot-manager update it won't put it back.

We're looking at ways to improve compression on those initrds, or other ways to shrink them, but ultimately, you're going to need a bigger EFI partition, especially to support dual boot. Hopefully our new 512 MB default will be much friendlier to that going forward. We're also considering ideas for safely cleaning up older files that are currently left behind.

bwarden avatar Oct 11 '24 21:10 bwarden

Interesting but I think the problem goes deeper still.

It's not checking for used space or cleaning up after itself.

I don't remember how to mount /boot from wherever it's hidden and figure out what files to delete, it took me 3 hours to figure it out last time. Is there an easier way?

How can I resize my boot partition?

Another part of the problem is it doesn't properly defer upgrades when multiple firmware releases are made in one upgrade cycle without a reboot in between. It never deletes the old or intermediate versions -- old is currently running (because no reboot) and intermediate is still not "current", so it just keeps writing crap.

It also doesn't clean broken files before trying to put new ones in.

It also doesn't pass the error onto the actual updater (swupd) so there is no flag that any error took place.

Yes please also fix the bug where users can't mask out the unused firmware.

I think the underlying algorithm needs a bit of an overhaul -- even a gigabyte of space won't prevent this from happening after a couple of "multi-upgrade" cycles (without rebooting).

On Fri, Oct 11, 2024, 5:33 PM Brett T. Warden @.***> wrote:

Can you help with any of the existing systems that are already installed and are suffering from this bug? What can I do to clean up and nudge the update forward?

Root cause is you're out of space in your EFI partition. clr-boot-manager attempts to copy the new files in as tempfiles, then mv them to the final destination atomically, so your /boot is never incomplete. In the short term, you need to make some room in /boot long enough for clr-boot-manager to do that. In your case specifically, I'd temporarily move the Microsoft directory out of /boot, run sudo clr-boot-manager update, then copy the Microsoft directory back in. That 28 MB should be enough to copy a file at a time, I think.

If you don't want the i915 drivers at all, you can mask that cpio completely using these instructions from man clr-boot-manager:

/etc/kernel/initrd.d/* A set of files that will be used as additional user’s freestanding initrd files. Additional initrd arguments will be added to the kernel argument list, if desired the user may mask out system installed initrd files by creating symbolic links within /etc/kernel/initrd.d pointing to /dev/null with same name of system installed files.

This step doesn't currently work -- I found and am fixing a bug in clr-boot-manager

For example: sudo mkdir -p /etc/kernel/initrd.d && sudo ln -s /dev/null /etc/kernel/initrd.d/i915-firmware.cpio -- then delete the i915 cpio from /boot manually, and when you re-run clr-boot-manager update it won't put it back.

We're looking at ways to improve compression on those initrds, or other ways to shrink them, but ultimately, you're going to need a bigger EFI partition, especially to support dual boot. Hopefully our new 512 MB default will be much friendlier to that going forward.

— Reply to this email directly, view it on GitHub https://github.com/clearlinux/distribution/issues/3192#issuecomment-2408147059, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBW5M5FOVSGIGBRGJ4DZ3A7YZAVCNFSM6AAAAABPXR5SWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYGE2DOMBVHE . You are receiving this because you authored the thread.Message ID: @.***>

GabeAl avatar Oct 11 '24 22:10 GabeAl

We'll be looking at size checks and error reporting too. There's not a good mechanism to bubble that up through swupd, so it won't be a quick thing.

Another part of the problem is it doesn't properly defer upgrades when multiple firmware releases are made in one upgrade cycle without a reboot in between. It never deletes the old or intermediate versions -- old is currently running (because no reboot) and intermediate is still not "current", so it just keeps writing crap.

I'm really curious about this one -- in your examples I don't see multiple versions of things other than the kernel itself, and we keep only up to 3 copies (which certainly add up) of the kernel to make sure you always have at least one that has been proven to boot on your system. And that later gets pruned down to 2 in most cases. Can you show an example of extra or broken files you saw before cleaning them up manually? The cpio files don't have versioned names, so they should be fully replaced every time, not accumulating.

I don't remember how to mount /boot

sudo systemctl start boot.mount; sudo -s

How can I resize my boot partition?

You'd need to use something like GParted, booting from an external device so you can shrink and move the root partition to make room to expand the EFI partition.

bwarden avatar Oct 11 '24 22:10 bwarden

It looks like even more junk is being copied to the boot partition now. Things are really getting out of control. It now requires manually scouring older systems every update or else it breaks.

Also a new problem: now the commandline for kernel boot is specifically referring to the intel i915 firmware bundle EVEN IF THE SYSTEM IS AMD. So the system literally will not boot unless I manually hit 'e' during boot and erase that reference to the firmware file I just spent time erasing each cycle.

This really needs a proper fix. Please revisit the logic of 'adding things to initrd during boot'. I don't think any presumed benefits have actually been tested to be worthwhile especially at the cost of constantly corrupting the boot partition, leaving older systems silently running vulnerable kernels, and the inflexibility of not being able to turn this broken "feature" off.

GabeAl avatar Dec 25 '24 16:12 GabeAl

We'll be looking at size checks and error reporting too. There's not a good mechanism to bubble that up through swupd, so it won't be a quick thing.

Save the log during the "copy tons of junk to boot" operation. Add a task to read the log to the swapd update scripts. Return -1 if error. Have swapd read the return code.

Another part of the problem is it doesn't properly defer upgrades when multiple firmware releases are made in one upgrade cycle without a reboot in between. It never deletes the old or intermediate versions -- old is currently running (because no reboot) and intermediate is still not "current", so it just keeps writing crap.

I'm really curious about this one -- in your examples I don't see multiple versions of things other than the kernel itself, and we keep only up to 3 copies (which certainly add up) of the kernel to make sure you always have at least one that has been proven to boot on your system. And that later gets pruned down to 2 in most cases. Can you show an example of extra or broken files you saw before cleaning them up manually? The cpio files don't have versioned names, so they should be fully replaced every time, not accumulating.

Not everyone can update their system one increment at a time. Not every system is constantly "live" and internet-connected. Whenever more than 1 kernel bump happens during a single update (spanning an 'intermediate' update), the pruning DOES NOT HAPPEN during that same update cycle. All files from all intermediate updates are copied all at once, and no pruning takes place. It craps the bed and reports success. Systems are left silently booting vulnerable kernels ready to be hacked.

image There are many old kernel versions still in here (the three from before the update was started, plus all of the versions installed in the next updates), as well as both a .zst and non-zst version of the 915 firmware. It also seems to have tried adding a totally new file to the same place (adding insult to injury) but it scrolled by too fast to see what new "firmware" it tried to copy in, and no trace is left because the space was already 100% full by that point and nothing was created.

This is totally broken.

You'd need to use something like GParted, booting from an external device so you can shrink and move the root partition to make room to expand the EFI partition.

It fails to resize because the EFI partition comes before the root partition... and the root partition can be large (mine is 15Tb) and often fails to move. This is not great. Please simply add an option to disable this VERY unwanted 'additional firmware sideload' and patch the gaping security hole created by this ill-conceived feature.

GabeAl avatar Dec 25 '24 16:12 GabeAl

The root of the problems you're experiencing is that your /boot is full, and we don't handle that cleanly. You've identified an important issue that masking an initrd doesn't remove it from the kernel command line in the bootloader entries, which we'll need to fix before we look at making i915 firmware truly optional. And you're right that clr-boot-manager should be able to communicate failures up to swupd for the sake of reporting if not fixing failures like this. In the meantime, though, if we can clean up your /boot, we should be able to get you back to a configuration that works smoothly.

Checking out my EFI partition, as an example:

# Mount EFI partition at /boot
$ sudo systemctl start boot.mount
# Check my current running kernel version
$ uname -r
6.10.12-1467.native
# Check contents of /boot
$ sudo ls -lR /boot
/boot:
total 2
drwx------ 2 root root 512 Jun 27  2024 BIOS
drwx------ 4 root root 512 Jul 18  2019 EFI
drwx------ 3 root root 512 Dec 11 06:22 loader

/boot/BIOS:
total 0

/boot/EFI:
total 3
drwx------ 2 root root  512 Jul 18  2019 BOOT
drwx------ 2 root root 2560 Dec 11 06:22 org.clearlinux

/boot/EFI/BOOT:
total 96
-rwx------ 1 root root 97634 Jul 18  2019 BOOTX64.EFI

/boot/EFI/org.clearlinux:
total 95926
-rwx------ 1 root root   936619 Oct 16 12:14 bootloaderx64.efi
-rwx------ 1 root root 12962304 Nov  5 18:33 freestanding-00-early-ucode.cpio
-rwx------ 1 root root 38665256 Jun 25  2024 freestanding-clr-init.cpio.gz
-rwx------ 1 root root  8147789 Nov  5 18:33 freestanding-i915-firmware.cpio.zst
-rwx------ 1 root root   858624 Oct  3 11:40 initrd-org.clearlinux.native.6.10.12-1467
-rwx------ 1 root root   866816 Dec 11 06:22 initrd-org.clearlinux.native.6.12.4-1518
-rwx------ 1 root root 17203200 Oct  3 11:40 kernel-org.clearlinux.native.6.10.12-1467
-rwx------ 1 root root 17613312 Dec 11 06:22 kernel-org.clearlinux.native.6.12.4-1518
-rwx------ 1 root root   129536 Oct 16 12:14 loaderx64.efi
-rwx------ 1 root root   843105 Oct 16 12:14 mmx64.efi

/boot/loader:
total 2
drwx------ 2 root root 1536 Dec 11 06:22 entries
-rwx------ 1 root root   54 Dec 11 06:22 loader.conf

/boot/loader/entries:
total 2
-rwx------ 1 root root 719 Nov  5 18:33 Clear-linux-native-6.10.12-1467.conf
-rwx------ 1 root root 717 Dec 11 06:22 Clear-linux-native-6.12.4-1518.conf

You should at least have files for the current running kernel and the latest kernel. If you have additional kernel bundles installed you'll see more files, but for the sake of space, you'd probably want to pick a single kernel type.

As for the initrds:

  • freestanding-00-early-ucode.cpio: microcode update -- you generally want this unless you and your motherboard vendor are really good at keeping your BIOS updated. This one can't be compressed, unfortunately.
  • freestanding-clr-init.cpio.gz: installed by the bootloader-extras bundle, which you likely only need if you have a compex root filesystem, such as LVM2, RAID, and/or encryption. Hopefully you don't have this one in the first place.
  • freestanding-i915-firmware.cpio.zst: Now compressed, I got it down from about 24 MB to 8 MB, so that should help. As you noticed, the old, non-compressed version wasn't deleted because other file writes failed first.
  • initrd-org.clearlinux.native.*: these are very small initrds that just provide keyboard drivers.

I would try mounting and pruning your /boot down to something similar (and moving any non-Clear Linux content somewhere safe temporarily) so you have enough free space for an update to complete properly.

bwarden avatar Jan 03 '25 21:01 bwarden