elemental-toolkit
elemental-toolkit copied to clipboard
upgrade failed when upgrading from passive
cos-toolkit version: master (built probably June 14/15 whenever the initramfs "cut" problem was introduced)
CPU architecture, OS, and Version: amd64 Leap 15.3
Describe the bug This is a potential bug.
elemental upgrade --reboot failed to remount /run/initramfs/cos-state ro, un ultimately reboot (elemental cli never finished the Cleanup step.)
I wanted to upgrade to pick up the "cut" fix. My system was booted to passive because active had the missing "cut". Ran "elemental upgrade --reboot" configured to upgrade from my local registry. (Note: references to k3os have nothing to do with rancher os2. I have basically built my own flavor of k3os v1 with elemental toolkit.)
/etc/luet/luet.yaml:
logging:
color: false
enable_emoji: false
general:
debug: false
spinner_charset: 9
repositories:
- name: local
description: local
type: docker
enable: true
cached: true
priority: 1
verify: false
urls:
- 192.168.0.33:5000/k3os
/etc/elemental/config.yaml
upgrade:
system:
uri: docker:192.168.0.33:5000/k3os:staging
recovery-system:
uri: channel:recovery/cos
reboot: true
Ran "elemental upgrade --reboot" as root.
To Reproduce Boot to passive and attempt an upgrade? (Not 100% sure.)
Expected behavior Expected the upgrade to complete and reboot.
Logs
INFO[2022-06-17T10:43:14Z] Upgrade completed
E0617 10:43:14.216350 19578 mount_linux.go:195] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /run/initramfs/cos-state --scope -- mount -t auto -o remount,ro /dev/sda4 /run/initramfs/cos-state
Output: Running scope as unit: run-r72f77cfd7d8145ac9d1808cf43912cc5.scope
mount: /run/initramfs/cos-state: mount point is busy.
ERRO[2022-06-17T10:43:14Z] Failed mounting device /dev/sda4 with label COS_STATE
mount:
/dev/sda4 on /run/initramfs/cos-state type ext4 (rw,relatime)
Clearly, the intention is for COS_STATE to be remounted as read-only.
I ran this directly:
systemd-run --description="Kubernetes transient mount for /run/initramfs/cos-state" --scope -- mount -t auto -o remount,ro /dev/sda4 /run/initramfs/cos-state
Running scope as unit: run-r16db1092507044c5b1eb2947d629aeb3.scope
mount: /run/initramfs/cos-state: mount point is busy.
Installed lsof (via scp) and no files were opened.
Discovered loopback device was deleted.
k3os-staging:/var/log # losetup -a
/dev/loop0: [2052]:262147 (/run/initramfs/cos-state/cOS/passive.img (deleted))
Even just a simple umount of /run/initramfs/cos-state reported target busy.
I tried everything I could think of to make the device not busy, to no avail. I ended up rebooting, since the upgrade had actually succeeded, it just died in the unwinding Cleanup() in elemental-cli.
Additional context k3s WAS running at the time of the upgrade, but, I have done this many times now, so I don't suspect this affected anything. However, I am using the embedded etcd for the first time. Not sure if that might have contributed to this.
I realize now that loop0 is mounted on /, so, any file being opened while the fs was mounted as rw, could have caused this. I did k3s-killall.sh before attempting all of the above diagnostics, but not before I did the initial upgrade attempt. I bet that's what it was. Something opened a file while the upgrade had / mounted rw and by the time upgrade completed, / was busy.
Thanks for the report @kcburge !
When implementing the upgrade I did hit this on occasion, when the cos-state partition was unable to be remounted RO after the upgrade due to it being busy, although I could never reproduce it consistently to find out what was keeping it from remounting it.
In the end, it should not affect much, as its done after the upgrade is finished so the system is correct and a reboot should be the next thing but we would really like to keep the system in the same state as when we started.
I guess this kind of makes sense if upgrading from passive as we move the old active into passive before upgrading the active, thus we are removing the current system.
Do you have in hand your custom k3os and the cloud config used in order to try to reproduce it? Were you able to reproduce this or it was a once only issue?
Thanks!
See https://github.com/rancher/elemental-cli/issues/431 for more info, closing this, feel free to reopen if it is still an issue