Improve `talos upgrade` command
Feature Request
Description
Context: https://github.com/siderolabs/talos/discussions/12132
TL;DR:
talos upgrade performs no validation if the chosen image is compatible with the node. To make matters worse, the --image option has a default argument, which upgrades the node to a "default" image.
The "default" image is seemingly randomly chosen as it is unlikely to have the same schematic ID as the one on the node or might be an older version than what is currently installed.
An implicit upgrade to a "default" image would be unexpected for a Talos administrator, but should be generally safe as it is assumed it is a reversible action due to Talos' A/B upgrades and the existence of the talosctl rollback command.
However, as shown in the discussion linked above - there is an edge case. For Talos nodes with secure boot enabled, upgrading to a non-secure boot image completely bricks the node, as systemd-boot gets replaced.
To recover from such situation, at the very least requires physical access and a live USB. However, if combined with TPM-encrypted partitions/disks, booting from an USB causes the node to get stuck in a "booting" state, as the TPM will refuse to unseal the keys for the partitions/disks. Thus wiping the STATE and EPHEMERAL partitions and starting fresh becomes the only option, leading to data loss
Feature Request
With that in mind, I would like to suggest a few ideas on how to improve the upgrade command, so that you cannot accidentally shoot yourself in the foot:
- Remove the default value for the
--imageoption - Perform validation checks when upgrading and require
--force(or something to that effect) if upgrading to an older image, from a secure boot to a non-secure boot image, or if changing image arch. - Change the default
--imagevalue to be "what is currently specified in the machine config" or require--imageto be provided if--insecureis used
Bonus: Add a "boot into maintanance mode without wiping the system disk" boot entry to the Talos ISO
I want to rephrase the goal.
I think we should drop default --image, and even hide it completely as a legacy way.
Instead, we should have:
- Image Factory host, defaults to
factory.talos.dev - schematic ID, defaults to the machine's schematic
- Talos version, defaults to
talosctlversion - SecureBoot/non-SecureBoot - defaults to machine's status
- Platform, defaults to the machine's platform
With that in hand, talosctl can build correct upgrade URL, with most of the time just specifying one flag should produce a correct image reference.