osbuild-composer rpm-ostree upgrade fails in edge-commit RHEL-9.6

Describe the bug Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR https://github.com/osbuild/osbuild-composer/pull/4569 rpm-ostree upgrade fails to upgrade the system. After ostree image/commit upgrade is built, the edge system detects there's an upgrade available, but after rpm-ostree upgrade and reboot, the system rolls back to the previous deployment and the update is not applied.

Environment

OS version (/etc/os-release and /etc/redhat-release): source /etc/os-release NAME='Red Hat Enterprise Linux' VERSION='9.6 (Plow)' ID=rhel ID_LIKE=fedora VERSION_ID=9.6 PLATFORM_ID=platform:el9 PRETTY_NAME='Red Hat Enterprise Linux 9.6 Beta (Plow)' ANSI_COLOR='0;31' LOGO=fedora-logo-icon CPE_NAME=cpe:/o:redhat:enterprise_linux:9::baseos HOME_URL=https://www.redhat.com/ DOCUMENTATION_URL=https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9 BUG_REPORT_URL=https://issues.redhat.com/ REDHAT_BUGZILLA_PRODUCT='Red Hat Enterprise Linux 9' REDHAT_BUGZILLA_PRODUCT_VERSION=9.6 REDHAT_SUPPORT_PRODUCT='Red Hat Enterprise Linux' REDHAT_SUPPORT_PRODUCT_VERSION='9.6 Beta'
osbuild-composer version (rpm -qi osbuild-composer) $ rpm -qa | grep osbuild osbuild-composer-debugsource-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-debuginfo-130-1.20250129git008b43e.el9.x86_64 python3-osbuild-137-1.el9.noarch osbuild-selinux-137-1.el9.noarch osbuild-137-1.el9.noarch osbuild-depsolve-dnf-137-1.el9.noarch osbuild-composer-core-130-1.20250129git008b43e.el9.x86_64 osbuild-luks2-137-1.el9.noarch osbuild-lvm2-137-1.el9.noarch osbuild-ostree-137-1.el9.noarch osbuild-composer-worker-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-tests-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-core-debuginfo-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-tests-debuginfo-130-1.20250129git008b43e.el9.x86_64 osbuild-composer-worker-debuginfo-130-1.20250129git008b43e.el9.x86_64

To Reproduce Steps to reproduce the behavior:

Build edge-commit artifact in RHEL-9.6
Build ostree image/commit upgrade artifact
Apply the upgrade using rpm-ostree upgrade and reboot the system.

Expected behavior The system is able to apply the upgrade commit.

Additional context In this example the upgrade hash is:

$ curl http://192.168.100.1/repo/refs/heads/rhel/9/x86_64/edge
75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

$ sudo virsh console osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923
Connected to domain 'osbuild-composer-ostree-test-4b6e4700-ce4b-48d7-8c25-811f4876b923'
Escape character is ^] (Ctrl + ])

vm login: admin
Password: 
Last login: Wed Jan 29 12:11:27 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

The edge-system detects there's an upgrade available, but rpm-ostree upgrade fails and the system rollbacked to 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b:

[admin@vm ~]$ sudo rpm-ostree upgrade
1 metadata, 0 content objects fetched; 401 B transferred in 0 seconds; 0 bytes content written
Staging deployment... done
Freed: 7.8 kB (pkgcache branches: 1)
Added:
  wget-1.21.1-8.el9_4.x86_64
Run "systemctl reboot" to start a reboot
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b
                     Diff: 1 added

● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

Then the system fails to upgrade, and rollback to ostree:1

Red Hat Enterprise Linux 9.6 Beta (Plow)
Kernel 5.14.0-547.el9.x86_64 on an x86_64

vm login: admin
Password: 
Last login: Wed Jan 29 12:20:35 on ttyS0
[admin@vm ~]$ rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:34:37Z)
                   Commit: 583f1f500bb5ee3f858409203df2f1883e20cb4cee6a6a4149caafa197a1c95b

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-01-29T11:47:15Z)
                   Commit: 75d95ee9dfd0f1e2ddf2e622293ba15ac5609077cd69271ee463b21954aeb31b

It seems the system is failing to remount the file system:

[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -xe | grep FAIL
Jan 29 12:25:35 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[736]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot[736]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost systemd[1]: greenboot-healthcheck.service: Main process exited, code=exited, status=1/FAILURE
Jan 29 12:25:38 localhost greenboot[801]: Boot Status is RED - Health Check FAILURE!
Jan 29 12:25:38 localhost greenboot-status[822]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Script '01_update_platforms_check.sh' FAILURE (exit code '1'). Continuing...
Jan 29 12:25:38 localhost greenboot-status[822]: Boot Status is RED - Health Check FAILURE!

Feb 03 '25 16:02 mcattamoredhat

I think this might be related to https://github.com/ostreedev/ostree/issues/3193

@runcom you were looking at systemd-remount-fs.service failures recently; maybe you have some insight

Feb 03 '25 22:02 miabbott

I’ll check it out, maybe composefs? ~~Although, what is greenboot exit code 4 too? @say-paul~~

Feb 03 '25 22:02 runcom

This may be relevant https://github.com/ostreedev/ostree/issues/3193#issuecomment-2578264200

@mcattamoredhat do you know what exactly changed in the new snapshot? rpm-ostree? Just ostree? Can you print versions and also provide the content of /etc/fstab and /proc/cmdline

Feb 03 '25 22:02 runcom

it seems that changes in https://github.com/osbuild/osbuild-composer/pull/4569/files are tests only, so how did you reproduce this @mcattamoredhat ? 🤔 I'm trying with 9.6 nightlies repo enabled, building a commit and upgrade (using a raw image to install)

Feb 04 '25 10:02 runcom

since ostree.sh uses anaconda, this may be relevant https://bugzilla.redhat.com/show_bug.cgi?id=2332319 if we understand it's systemd-remount-fs.service that it's causing this issue (still not sure and I say this because there's no bootc involved here...nor composefs enabled)

Feb 04 '25 10:02 runcom

I think the remount service is a red herring tho - it seems it's greenboot that fails and triggers the rollback 🤔

Feb 04 '25 11:02 runcom

This is what Mario has, the system is installed using Anaconda, but there's no bootc nor composefs (cc @cgwalters for the similar failiure) - we'll try w/o the / line in /etc/fstab -- also, it seems there's some sort of network failure to me in rpm-ostree/rpm-ostreed

[admin@vm ~]$ !2
cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/ostree/edge-commit-68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/vmlinuz-5.14.0-547.el9.x86_64 net.ifnames=0 modprobe.blacklist=vc4 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M console=ttyS0,115200 root=UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 rw ostree=/ostree/boot.1/edge-commit/68c3fe04cf09b3082bbe68c4d771a4ec122ea9cba2c5c0ef850740a227691aaf/1
[admin@vm ~]$ !3
cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Tue Feb  4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 /                       xfs     defaults        0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot                   xfs     defaults        0 0

[admin@vm ~]$ sudo journalctl --boot=-1 --no-pager -eu systemd-remount-fs.service
Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.
Feb 04 11:22:49 localhost systemd-remount-fs[617]: /usr/bin/mount for / exited with exit status 32.
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:49 localhost systemd[1]: systemd-remount-fs.service: Failed with result 'exit-code'.
Feb 04 11:22:49 localhost systemd[1]: Failed to start Remount Root and Kernel File Systems.
[admin@vm ~]$ 

[admin@vm ~]$ sudo journalctl --no-pager --boot=-1 -eu rpm-ostreed.service
Feb 04 11:22:52 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:22:52 localhost rpm-ostree[772]: error: Error receiving data: Connection reset by peer
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:22:52 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:22:52 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.

Feb 04 '25 12:02 runcom

the watchdog check is a required one so that failing means we rollback too (update platforms checks instead is just wanted so shouldn't cause the rollback)

Feb 04 '25 12:02 runcom

After commenting out the/line in /etc/fstab the system is still making rollback:

[admin@vm ~]$ cat /etc/fstab 

#
# /etc/fstab
# Created by anaconda on Tue Feb  4 11:08:58 2025
#
# Accessible filesystems, by reference, are maintained under '/dev/disk/'.
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info.
#
# After editing this file, run 'systemctl daemon-reload' to update systemd
# units generated from this file.
#
# UUID=0a3a800a-d0b7-4ab7-aaa3-a2c2af00bca0 /                       xfs     defaults        0 0
UUID=d9ac8899-d8a4-43cc-93cf-290ed9892683 /boot                   xfs     defaults        0 0

Upgrade commit is 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d

[admin@vm ~]$ sudo rpm-ostree status
State: idle
Deployments:
● edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-02-04T11:06:00Z)
                   Commit: f63986e3544dffa98c3b97de358b26de4499f7993096c49efbdd11041d347a13

  edge-commit:rhel/9/x86_64/edge
                  Version: 9.6 (2025-02-04T11:17:59Z)
                   Commit: 1b5ba91f75c7ab115882be3f83c3668f2c88b2f7850e92fc2355a361839f594d

Feb 04 '25 12:02 mcattamoredhat

Our CI has detected RHEL-9.6 edge-commit fails after the changes introduced by PR https://github.com/osbuild/osbuild-composer/pull/4569

I am not an expert in this repo but that would seem to be a surprising cause.

Feb 04 11:22:49 localhost systemd-remount-fs[619]: mount: /: cannot remount /dev/vda2 read-write, is write-protected.

"is write-protected" here means we got EROFS from mount which usually means the physical block device is read-only. Is the CI system here only providing a read-only virtio device? I'd look for more logs related to that.

Feb 04 '25 13:02 cgwalters

right, although I just think that the rollback isn't related - greenboot doesn't check for that so whatever happens it's greenboot @say-paul

Feb 04 '25 13:02 runcom

so it seems that dbus can't start for some reason which makes rpm-ostreed unfunctional:

Feb 04 11:19:56 localhost systemd[1]: Starting D-Bus System Message Bus...
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed to set up mount namespacing: /run/systemd/unit-root/dev: Read-only file system
Feb 04 11:19:56 localhost systemd[717]: dbus-broker.service: Failed at step NAMESPACE spawning /usr/bin/dbus-broker-launch: Read-only file system
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Main process exited, code=exited, status=226/NAMESPACE
Feb 04 11:19:56 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.

Feb 04 11:19:58 localhost systemd[1]: Listening on D-Bus System Message Bus Socket.
Feb 04 11:19:58 localhost systemd[1]: Starting rpm-ostree System Management Daemon...
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Start request repeated too quickly.
Feb 04 11:19:58 localhost systemd[1]: dbus-broker.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start D-Bus System Message Bus.
Feb 04 11:19:58 localhost systemd[1]: dbus.socket: Failed with result 'service-start-limit-hit'.
Feb 04 11:19:58 localhost rpm-ostree[776]: error: Error receiving data: Connection reset by peer
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE
Feb 04 11:19:58 localhost systemd[1]: rpm-ostreed.service: Failed with result 'exit-code'.
Feb 04 11:19:58 localhost systemd[1]: Failed to start rpm-ostree System Management Daemon.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: Job for rpm-ostreed.service failed because the control process exited with error code.
Feb 04 11:19:58 localhost 02_watchdog.sh[775]: See "systemctl status rpm-ostreed.service" and "journalctl -xeu rpm-ostreed.service" for details.
Feb 04 11:19:58 localhost 02_watchdog.sh[772]: parse error: Invalid numeric literal at line 1, column 3
Feb 04 11:19:58 localhost 02_watchdog.sh[771]: error: Loading sysroot: exit status: 1
Feb 04 11:19:58 localhost greenboot[738]: Script '02_watchdog.sh' FAILURE (exit code '4'). Continuing...

Feb 04 '25 14:02 runcom

That looks like a symptom of missing /tmp as a tmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4

Feb 04 '25 14:02 cgwalters

That looks like a symptom of missing /tmp as a tmpfs...which is part of the reference base image: https://gitlab.com/fedora/bootc/base-images/-/blame/main/tier-0/basic-fixes.yaml?ref_type=heads#L4

uhm, but this is not bootc 😄 I'm seeing a bunch of tmpfs issues indeed

Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.X11-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.ICE-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.XIM-unix": Read-only file system
Feb 04 11:19:55 localhost systemd-tmpfiles[673]: Failed to create directory or subvolume "/tmp/.font-unix": Read-only file system

Feb 04 '25 14:02 runcom

@thozza @achilleas-k do you know more here from the top of your heads? 👼

Feb 04 '25 14:02 runcom

maybe slightly related as a change in ostree https://github.com/ostreedev/ostree/pull/3366 ?

Feb 04 '25 15:02 runcom

This seems to be the case for us now https://github.com/ostreedev/ostree/pull/3366#issuecomment-2593935154 @cgwalters

Feb 04 '25 15:02 runcom

https://github.com/ostreedev/ostree/pull/3353#issuecomment-2580590517

so ostree-2024.10 may be breaking for us as we upgrade from .9 to that w/o having a prepare-root.conf w/ composefs disabled. Maybe upgrading straight to 2025.1 is gonna fix it?

Feb 04 '25 15:02 runcom

Ah yes, sorry. We withdrew 2024.10 from Fedora bodhi, but not C{9,10}S as there's no real "undo" button there. In any case 2025.1 is already queued to ship in 9.6 and beyond.

Feb 04 '25 16:02 cgwalters

Ugh, yeah 2025.1 is stuck in QE, will try to get that fixed

Feb 04 '25 17:02 cgwalters

so we need the snapshots here to at least target 20250201 - that snapshot contains ostree-2025.1 cc @thozza

Feb 04 '25 17:02 runcom

We have snapshots from 20250201, the PR is still open though: https://github.com/osbuild/osbuild-composer/pull/4591

Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).

Feb 04 '25 18:02 achilleas-k

Quick look shows me that the rpm-ostree version there is 2025.4 (for RHEL 9.6).

It's ostree, not rpm-ostree at issue here

Feb 04 '25 22:02 cgwalters

Right, my mistake. In that case it's 2025.1.

Feb 05 '25 00:02 achilleas-k

After the snapshot update to 20250201 edge-commit test in RHEL-9.6 is still failing https://artifacts.osci.redhat.com/testing-farm/8876a623-b410-499c-affd-727dbb89054f/work-edge-x86-commitqqviedpt/tmt/plans/edge-test/edge-x86-commit/execute/data/guest/default-0/tmt/tests/edge-test-1/output.txt

After reproducing this failure locally, it seems anaconda fails to install bootloader:

Installing boot loader
..
Performing post-installation setup tasks
================================================================================
================================================================================
Question

 The following error occurred while installing the boot loader. The system will
 not be bootable. Would you like to ignore this and continue with installation?
 
 failed to write boot loader configuration

An unknown error has occured, look at the /tmp/anaconda-tb* file(s) for more details


===============================================================================

ne 311, in start
    item.start()
  File "/usr/lib64/python3.9/site-packages/pyanaconda/installation_tasks.py", line 311, in start
    item.start()
  File "/usr/lib64/python3.9/site-packages/pyanaconda/installation.py", line 399, in run_installation
    queue.start()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/site-packages/pyanaconda/threading.py", line 275, in run
    threading.Thread.run(self)
pyanaconda.modules.common.errors.installation.BootloaderInstallationError: failed to write boot loader configuration

What do you want to do now?
1) Report Bug
2) Debug
3) Run shell
4) Quit

Feb 05 '25 13:02 mcattamoredhat

Do you have bootupd in your tree? See https://pagure.io/workstation-ostree-config/pull-request/600# that explicitly excludes it. If the Edge-9.6 setup isn't ready for bootupd then at the current time the package needs to be excluded.

(All this pain will go away when we consolidate on a reference, tested base image defined as a container image going forward)

Feb 05 '25 13:02 cgwalters

Right, we actually did that with fedora before we supported it afaict https://github.com/osbuild/images/pull/918

PR for rhel https://github.com/osbuild/images/pull/1195 PR here for integration https://github.com/osbuild/osbuild-composer/pull/4597

Feb 05 '25 13:02 runcom

Exluding bootupd with https://github.com/osbuild/images/pull/1195 fixes bootloader issue.

Nevertheless, edge-commit in RHEL-9.6 is still failing, we will continue debugging.

Feb 05 '25 16:02 mcattamoredhat

Seems like we're now hitting a greenboot issue somehow - @say-paul is on it (but the bootupd actually fixes the anaconda failure)

Feb 06 '25 08:02 runcom

we have a bug in greenboot https://github.com/fedora-iot/greenboot/blob/main/usr/libexec/greenboot/greenboot-rpm-ostree-grub2-check-fallback#L8-L15 - fix in https://github.com/fedora-iot/greenboot/pull/199

Feb 06 '25 09:02 runcom

osbuild-composer osbuild-composer copied to clipboard

rpm-ostree upgrade fails in edge-commit RHEL-9.6

osbuild-composer
osbuild-composer copied to clipboard