elemental-toolkit upgrade failed when upgrading from passive

cos-toolkit version: master (built probably June 14/15 whenever the initramfs "cut" problem was introduced)

CPU architecture, OS, and Version: amd64 Leap 15.3

Describe the bug This is a potential bug.

elemental upgrade --reboot failed to remount /run/initramfs/cos-state ro, un ultimately reboot (elemental cli never finished the Cleanup step.)

I wanted to upgrade to pick up the "cut" fix. My system was booted to passive because active had the missing "cut". Ran "elemental upgrade --reboot" configured to upgrade from my local registry. (Note: references to k3os have nothing to do with rancher os2. I have basically built my own flavor of k3os v1 with elemental toolkit.)

/etc/luet/luet.yaml:

logging:
  color: false
  enable_emoji: false
general:
  debug: false
  spinner_charset: 9
repositories:
  - name: local
    description: local
    type: docker
    enable: true
    cached: true
    priority: 1
    verify: false
    urls:
      - 192.168.0.33:5000/k3os

/etc/elemental/config.yaml

upgrade:
  system:
    uri: docker:192.168.0.33:5000/k3os:staging
  recovery-system:
    uri: channel:recovery/cos
reboot: true

Ran "elemental upgrade --reboot" as root.

To Reproduce Boot to passive and attempt an upgrade? (Not 100% sure.)

Expected behavior Expected the upgrade to complete and reboot.

Logs

INFO[2022-06-17T10:43:14Z] Upgrade completed
E0617 10:43:14.216350   19578 mount_linux.go:195] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /run/initramfs/cos-state --scope -- mount -t auto -o remount,ro /dev/sda4 /run/initramfs/cos-state
Output: Running scope as unit: run-r72f77cfd7d8145ac9d1808cf43912cc5.scope
mount: /run/initramfs/cos-state: mount point is busy.

ERRO[2022-06-17T10:43:14Z] Failed mounting device /dev/sda4 with label COS_STATE

mount:

/dev/sda4 on /run/initramfs/cos-state type ext4 (rw,relatime)

Clearly, the intention is for COS_STATE to be remounted as read-only.

I ran this directly:

systemd-run --description="Kubernetes transient mount for /run/initramfs/cos-state" --scope -- mount -t auto -o remount,ro /dev/sda4 /run/initramfs/cos-state
Running scope as unit: run-r16db1092507044c5b1eb2947d629aeb3.scope
mount: /run/initramfs/cos-state: mount point is busy.

Installed lsof (via scp) and no files were opened.

Discovered loopback device was deleted.

k3os-staging:/var/log # losetup -a
/dev/loop0: [2052]:262147 (/run/initramfs/cos-state/cOS/passive.img (deleted))

Even just a simple umount of /run/initramfs/cos-state reported target busy.

I tried everything I could think of to make the device not busy, to no avail. I ended up rebooting, since the upgrade had actually succeeded, it just died in the unwinding Cleanup() in elemental-cli.

Additional context k3s WAS running at the time of the upgrade, but, I have done this many times now, so I don't suspect this affected anything. However, I am using the embedded etcd for the first time. Not sure if that might have contributed to this.

Jun 17 '22 12:06 kcburge

I realize now that loop0 is mounted on /, so, any file being opened while the fs was mounted as rw, could have caused this. I did k3s-killall.sh before attempting all of the above diagnostics, but not before I did the initial upgrade attempt. I bet that's what it was. Something opened a file while the upgrade had / mounted rw and by the time upgrade completed, / was busy.

Jun 17 '22 12:06 kcburge

Thanks for the report @kcburge !

When implementing the upgrade I did hit this on occasion, when the cos-state partition was unable to be remounted RO after the upgrade due to it being busy, although I could never reproduce it consistently to find out what was keeping it from remounting it.

In the end, it should not affect much, as its done after the upgrade is finished so the system is correct and a reboot should be the next thing but we would really like to keep the system in the same state as when we started.

I guess this kind of makes sense if upgrading from passive as we move the old active into passive before upgrading the active, thus we are removing the current system.

Do you have in hand your custom k3os and the cloud config used in order to try to reproduce it? Were you able to reproduce this or it was a once only issue?

Thanks!

Jun 30 '22 08:06 Itxaka

See https://github.com/rancher/elemental-cli/issues/431 for more info, closing this, feel free to reopen if it is still an issue

May 11 '23 14:05 frelon

elemental-toolkit elemental-toolkit copied to clipboard

upgrade failed when upgrading from passive

elemental-toolkit
elemental-toolkit copied to clipboard