osbuild-composer
osbuild-composer copied to clipboard
rpm-ostree upgrade fails in edge-commit RHEL-9.6
Describe the bug
Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR https://github.com/osbuild/osbuild-composer/pull/4569
rpm-ostree upgrade fails to upgrade the system.
After ostree image/commit upgrade is built, the edge system detects there's an upgrade available, but after rpm-ostree upgrade and reboot, the system rolls back to the previous deployment and the update is not applied.
Environment
- OS version (
/etc/os-releaseand/etc/redhat-release): source /etc/os-release NAME='Red Hat Enterprise Linux' VERSION='9.6 (Plow)' ID=rhel ID_LIKE=fedora VERSION_ID=9.6 PLATFORM_ID=platform:el9 PRETTY_NAME='Red Hat Enterprise Linux 9.6 Beta (Plow)' ANSI_COLOR='0;31' LOGO=fedora-logo-icon CPE_NAME=cpe:/o:redhat:enterprise_linux:9::baseos HOME_URL=https://www.redhat.com/ DOCUMENTATION_URL=https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9 BUG_REPORT_URL=https://issues.redhat.com/ REDHAT_BUGZILLA_PRODUCT='Red Hat Enterprise Linux 9' REDHAT_BUGZILLA_PRODUCT_VERSION=9.6 REDHAT_SUPPORT_PRODUCT='Red Hat Enterprise Linux' REDHAT_SUPPORT_PRODUCT_VERSION='9.6 Beta' - osbuild-composer version (
rpm -qi osbuild-composer)$ rpm -qa | grep osbuild osbuild-composer-debugsource-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-debuginfo-130-1.20250129git008b43e.el9.x86_64 python3-osbuild-137-1.el9.noarch osbuild-selinux-137-1.el9.noarch osbuild-137-1.el9.noarch osbuild-depsolve-dnf-137-1.el9.noarch osbuild-composer-core-130-1.20250129git008b43e.el9.x86_64 osbuild-luks2-137-1.el9.noarch osbuild-lvm2-137-1.el9.noarch osbuild-ostree-137-1.el9.noarch osbuild-composer-worker-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-tests-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-core-debuginfo-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-tests-debuginfo-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-worker-debuginfo-130-1.20250129git008b43e.el9.x86_64
To Reproduce Steps to reproduce the behavior:
- Build edge-commit artifact in RHEL-9.6
- Build ostree image/commit upgrade artifact
- Apply the upgrade using
rpm-ostree upgradeand reboot the system.
Expected behavior The system is able to apply the upgrade commit.
Additional context In this example the upgrade hash is:
$ curl http://192.168.100.1/repo/refs/heads/rhel/9/x86_64/edge
75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
$ sudo virsh console osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923
Connected to domain 'osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923'
Escape character is ^] (Ctrl + ])
vm login: admin
Password:
Last login: Wed Jan 29 12:11:27 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:34:37Z)
Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b
edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:47:15Z)
Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
The edge-system detects there's an upgrade available, but rpm-ostree upgrade fails and the system rollbacked to 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b:
[admin@vm ~]$ sudo rpm-ostree upgrade
1 metadata, 0 content objects fetched; 401 B transferred in 0 seconds; 0 bytes content written
Staging deployment... done
Freed: 7.8 kB (pkgcache branches: 1)
Added:
wget-1.21.1-8.el9_4.x86_64
Run "systemctl reboot" to start a reboot
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:47:15Z)
Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
Diff: 1 added
● edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:34:37Z)
Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b
edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:47:15Z)
Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
Then the system fails to upgrade, and rollback to ostree:1
Red Hat Enterprise Linux 9.6 Beta (Plow)
Kernel 5.14.0-547.el9.x86_64 on an x86_64
vm login: admin
Password:
Last login: Wed Jan 29 12:20:35 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:34:37Z)
Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b
edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-01-29T11:47:15Z)
Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
It seems the system is failing to remount the file system:
[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -xe | grep FAIL
Jan 29 12:25:35 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[736]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot[736]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost systemd[1]: greenboot-healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[801]: Boot Status is RED - Health Check FAILURE!
Jan 29 12:25:38 localhost greenboot-status[822]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Boot Status is RED - Health Check FAILURE!
I think this might be related to https://github.com/ostreedev/ostree/issues/3193
@runcom you were looking at systemd-remount-fs.service failures recently; maybe you have some insight
I’ll check it out, maybe composefs? ~~Although, what is greenboot exit code 4 too? @say-paul~~
This may be relevant https://github.com/ostreedev/ostree/issues/3193#issuecomment-2578264200
@mcattamoredhat do you know what exactly changed in the new snapshot? rpm-ostree? Just ostree? Can you print versions and also provide the content of /etc/fstab and /proc/cmdline
it seems that changes in https://github.com/osbuild/osbuild-composer/pull/4569/files are tests only, so how did you reproduce this @mcattamoredhat ? 🤔 I'm trying with 9.6 nightlies repo enabled, building a commit and upgrade (using a raw image to install)
since ostree.sh uses anaconda, this may be relevant https://bugzilla.redhat.com/show_bug.cgi?id=2332319 if we understand it's systemd-remount-fs.service that it's causing this issue (still not sure and I say this because there's no bootc involved here...nor composefs enabled)
I think the remount service is a red herring tho - it seems it's greenboot that fails and triggers the rollback 🤔
This is what Mario has, the system is installed using Anaconda, but there's no bootc nor composefs (cc @cgwalters for the similar failiure) - we'll try w/o the / line in /etc/fstab -- also, it seems there's some sort of network failure to me in rpm-ostree/rpm-ostreed
[admin@vm ~]$ !2
cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/ostree/edge-commit-68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/vmlinuz-5.14.0-547.el9.x86_64 net.ifnames=0 modprobe.blacklist=vc4 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 root=UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 rw ostree=/ostree/boot.1/edge-commit/68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/1
[admin@vm ~]$ !3
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Feb 4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 / xfs defaults 0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot xfs defaults 0 0
[admin@vm ~]$ sudo journalctl --boot=-1 --no-pager -eu systemd-remount-fs.service
Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.
Feb 04 11:22:49 localhost systemd-remount-fs[617]: /usr/bin/mount for / exited with exit status 32.
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Failed with result 'exit-code'.
Feb 04 11:22:49 localhost systemd[1]: Failed to start Remount Root and Kernel File Systems.
[admin@vm ~]$
[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -eu rpm-ostreed.service
Feb 04 11:22:52 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:22:52 localhost rpm-ostree[772]: error: Error receiving data: Connection reset by peer
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:22:52 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.
the watchdog check is a required one so that failing means we rollback too (update platforms checks instead is just wanted so shouldn't cause the rollback)
After commenting out the/line in /etc/fstab the system is still making rollback:
[admin@vm ~]$ cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Feb 4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
# UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 / xfs defaults 0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot xfs defaults 0 0
Upgrade commit is 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d
[admin@vm ~]$ sudo rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-02-04T11:06:00Z)
Commit: f63986e3544dffa98c3b97de358b26de4499f7993096c49efbdd11041d347a13
edge-commit:rhel/9/x86_64/edge
Version: 9.6 (2025-02-04T11:17:59Z)
Commit: 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d
Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR https://github.com/osbuild/osbuild-composer/pull/4569
I am not an expert in this repo but that would seem to be a surprising cause.
Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.
"is write-protected" here means we got EROFS from mount which usually means the physical block device is read-only. Is the CI system here only providing a read-only virtio device? I'd look for more logs related to that.
right, although I just think that the rollback isn't related - greenboot doesn't check for that so whatever happens it's greenboot @say-paul
so it seems that dbus can't start for some reason which makes rpm-ostreed unfunctional:
Feb 04 11:19:56 localhost systemd[1]: Starting D-Bus System Message Bus...
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed to set up mount namespacing: /run/systemd/unit-root/dev: Read-only file system
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed at step NAMESPACE spawning /usr/bin/dbus-broker-launch: Read-only file system
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Main process exited, code=exited, status=226/NAMESPACE
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Listening on D-Bus System Message Bus Socket.
Feb 04 11:19:58 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Start request repeated too quickly.
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start D-Bus System Message Bus.
Feb 04 11:19:58 localhost systemd[1]: dbus.socket: Failed with result 'service-start-limit-hit'.
Feb 04 11:19:58 localhost rpm-ostree[776]: error: Error receiving data: Connection reset by peer
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: Job for rpm-ostreed.service failed because the control process exited with error code.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: See "systemctl status rpm-ostreed.service" and "journalctl -xeu rpm-ostreed.service" for details.
Feb 04 11:19:58 localhost 02_watchdog.sh[772]: parse error: Invalid numeric literal at line 1, column 3
Feb 04 11:19:58 localhost 02_watchdog.sh[771]: error: Loading sysroot: exit status: 1
Feb 04 11:19:58 localhost greenboot[738]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
That looks like a symptom of missing /tmp as a tmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4
That looks like a symptom of missing
/tmpas atmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4
uhm, but this is not bootc 😄 I'm seeing a bunch of tmpfs issues indeed
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.X11-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.ICE-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.XIM-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.font-unix": Read-only file system
@thozza @achilleas-k do you know more here from the top of your heads? 👼
maybe slightly related as a change in ostree https://github.com/ostreedev/ostree/pull/3366 ?
This seems to be the case for us now https://github.com/ostreedev/ostree/pull/3366#issuecomment-2593935154 @cgwalters
https://github.com/ostreedev/ostree/pull/3353#issuecomment-2580590517
so ostree-2024.10 may be breaking for us as we upgrade from .9 to that w/o having a prepare-root.conf w/ composefs disabled.
Maybe upgrading straight to 2025.1 is gonna fix it?
Ah yes, sorry. We withdrew 2024.10 from Fedora bodhi, but not C{9,10}S as there's no real "undo" button there. In any case 2025.1 is already queued to ship in 9.6 and beyond.
Ugh, yeah 2025.1 is stuck in QE, will try to get that fixed
so we need the snapshots here to at least target 20250201 - that snapshot contains ostree-2025.1 cc @thozza
We have snapshots from 20250201, the PR is still open though: https://github.com/osbuild/osbuild-composer/pull/4591
Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).
Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).
It's ostree, not rpm-ostree at issue here
Right, my mistake. In that case it's 2025.1.
After the snapshot update to 20250201 edge-commit test in RHEL-9.6 is still failing https://artifacts.osci.redhat.com/testing-farm/8876a623-b410-499c-affd-727dbb89054f/work-edge-x86-commitqqviedpt/tmt/plans/edge-test/edge-x86-commit/execute/data/guest/default-0/tmt/tests/edge-test-1/output.txt
After reproducing this failure locally, it seems anaconda fails to install bootloader:
Installing boot loader
..
Performing post-installation setup tasks
================================================================================
================================================================================
Question
The following error occurred while installing the boot loader. The system will
not be bootable. Would you like to ignore this and continue with installation?
failed to write boot loader configuration
An unknown error has occured, look at the /tmp/anaconda-tb* file(s) for more details
===============================================================================
ne 311, in start
item.start()
File "/usr/lib64/python3.9/site-packages/pyanaconda/installation_tasks.py", line 311, in start
item.start()
File "/usr/lib64/python3.9/site-packages/pyanaconda/installation.py", line 399, in run_installation
queue.start()
File "/usr/lib64/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 275, in run
threading.Thread.run(self)
pyanaconda.modules.common.errors.installation.BootloaderInstallationError: failed to write boot loader configuration
What do you want to do now?
1) Report Bug
2) Debug
3) Run shell
4) Quit
Do you have bootupd in your tree? See https://pagure.io/workstation-ostree-config/pull-request/600# that explicitly excludes it. If the Edge-9.6 setup isn't ready for bootupd then at the current time the package needs to be excluded.
(All this pain will go away when we consolidate on a reference, tested base image defined as a container image going forward)
Right, we actually did that with fedora before we supported it afaict https://github.com/osbuild/images/pull/918
PR for rhel https://github.com/osbuild/images/pull/1195 PR here for integration https://github.com/osbuild/osbuild-composer/pull/4597
Exluding bootupd with https://github.com/osbuild/images/pull/1195 fixes bootloader issue.
Nevertheless, edge-commit in RHEL-9.6 is still failing, we will continue debugging.
Seems like we're now hitting a greenboot issue somehow - @say-paul is on it (but the bootupd actually fixes the anaconda failure)
we have a bug in greenboot https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot-rpm-ostree-grub2-check-fallback#L8-L15 - fix in https://github.com/fedora-iot/greenboot/pull/199