kairos icon indicating copy to clipboard operation
kairos copied to clipboard

Add pre-upgrade checks for kairos upgrades

Open kpiyush17 opened this issue 1 year ago • 3 comments

Kairos version:

NAME="openSUSE Leap"
VERSION="15.5"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.5"
PRETTY_NAME="openSUSE Leap 15.5"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.5"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap"
LOGO="distributor-logo-Leap"
KAIROS_NAME="kairos-core-opensuse-leap"
KAIROS_VERSION="v2.4.3"
KAIROS_ID="opensuse-leap"
KAIROS_ID_LIKE="kairos-core-opensuse-leap"
KAIROS_VERSION_ID="v2.4.3"
KAIROS_PRETTY_NAME="kairos-core-opensuse-leap v2.4.3"
KAIROS_BUG_REPORT_URL="https://github.com/spectrocloud/CanvOS/issues"
KAIROS_HOME_URL="https://github.com/spectrocloud/CanvOS"
KAIROS_IMAGE_REPO="spectrocloud/CanvOS"
KAIROS_IMAGE_LABEL="latest"
KAIROS_GITHUB_REPO=""
KAIROS_VARIANT="opensuse-leap"
KAIROS_FLAVOR="opensuse-leap"
KAIROS_ARTIFACT="kairos-core-opensuse-leap-v2.4.3"

CPU architecture, OS, and Version:

Linux edge-3d9738427b5d0ea3da3ed0e9d3aa0894 5.14.21-150500.55.36-default #1 SMP PREEMPT_DYNAMIC Tue Oct 31 08:37:43 UTC 2023 (e7a2e23) x86_64 x86_64 x86_64 GNU/Linux

Describe the bug When running the kairos-agent upgrade inside a Kubernetes pod, if somehow pod gets terminated in between upgrade process, we observed that on retry the upgrade process never recovers from below errors.

We usually gets two types of errors in these scenario:

Type 1:

^[[37mDEBU^[[0m[2024-02-27T18:23:13Z] Running cmd: 'mkfs.ext2 -L COS_ACTIVE /host/run/initramfs/cos-state/cOS/transition.img'
^[[31mERRO^[[0m[2024-02-27T18:23:13Z] Failed deploying image to file '/host/run/initramfs/cos-state/cOS/transition.img': exit status 1
^[[37mDEBU^[[0m[2024-02-27T18:23:13Z] Unmounting partition COS_RECOVERY
^[[37mDEBU^[[0m[2024-02-27T18:23:13Z] Mounting partition COS_STATE
E0227 18:23:13.333424  114923 mount_linux.go:232] Mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /host/run/initramfs/cos-state --scope -- mount -t auto -o remount,ro /dev/sda4 /host/run/initramfs/cos-state
Output: Running scope as unit: run-r9563c116de21466ba9d99b061aeb72bc.scope
mount: /host/run/initramfs/cos-state: mount point is busy.

^[[31mERRO^[[0m[2024-02-27T18:23:13Z] Failed mounting device /dev/sda4 with label COS_STATE
2 errors occurred:
        * exit status 1
        * mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /host/run/initramfs/cos-state --scope -- mount -t auto -o remount,ro /dev/sda4 /host/run/initramfs/cos-state
Output: Running scope as unit: run-r9563c116de21466ba9d99b061aeb72bc.scope
mount: /host/run/initramfs/cos-state: mount point is busy.

Type 2:

^[[36mINFO^[[0m[2024-03-07T09:52:48Z] Moving /host/run/initramfs/cos-state/cOS/active.img to /host/run/initramfs/cos-state/cOS/passive.img
^[[37mDEBU^[[0m[2024-03-07T09:52:48Z] Running cmd: 'mv -f /host/run/initramfs/cos-state/cOS/active.img /host/run/initramfs/cos-state/cOS/passive.img'
^[[31mERRO^[[0m[2024-03-07T09:52:48Z] Failed to move /host/run/initramfs/cos-state/cOS/active.img to /host/run/initramfs/cos-state/cOS/passive.img: exit status 1
^[[37mDEBU^[[0m[2024-03-07T09:52:48Z] Not unmounting image, /run/cos/transition doesn't look like mountpoint
^[[37mDEBU^[[0m[2024-03-07T09:52:48Z] [Cleanup] Removing /host/run/initramfs/cos-state/cOS/transition.img
^[[37mDEBU^[[0m[2024-03-07T09:52:49Z] Unmounting partition COS_RECOVERY
^[[37mDEBU^[[0m[2024-03-07T09:52:49Z] Mounting partition COS_STATE
1 error occurred:
        * exit status 1

In both cases, the system is left in an inconsistent state unless we manually reboot the node.

Expected behavior Add some pre-upgrade checks so that every time the upgrade starts the partitions and other pre-requisites are fulfilled.

kpiyush17 avatar Mar 11 '24 06:03 kpiyush17

This would only happen if the upgrade is killed in the middle of it. How you continue and recover from this state, depends on where it was stopped. In the example above, you'd probably have to umount things manually and try again. There can't be a generic check that checks every possible dirty state.

Are you seeing that a lot? Any specific reason why the upgrades keep dying in the middle of the process?

jimmykarily avatar Mar 19 '24 09:03 jimmykarily

@jimmykarily We are running a kairos upgrade in a pod. And sometimes if the pod gets killed in between and a new pod restarts the upgrade we see this mount point busy error.

How you continue and recover from this state

We usually reboot the node.

There can't be a generic check that checks every possible dirty state.

Can we add a pre-check for all the required mount points to not be in busy and if it is try to unmount?

kpiyush17 avatar Jun 21 '24 06:06 kpiyush17

@jimmykarily We are running a kairos upgrade in a pod. And sometimes if the pod gets killed in between and a new pod restarts the upgrade we see this mount point busy error.

How you continue and recover from this state

We usually reboot the node.

There can't be a generic check that checks every possible dirty state.

Can we add a pre-check for all the required mount points to not be in busy and if it is try to unmount?

I assume you are using a Plan. Then you can run whatever commands you need before running the kairos-agent upgrade.

jimmykarily avatar Jun 21 '24 07:06 jimmykarily