talos icon indicating copy to clipboard operation
talos copied to clipboard

Improve `talos upgrade` command

Open astro-stan opened this issue 1 month ago • 1 comments

Feature Request

Description

Context: https://github.com/siderolabs/talos/discussions/12132

TL;DR:

talos upgrade performs no validation if the chosen image is compatible with the node. To make matters worse, the --image option has a default argument, which upgrades the node to a "default" image.

The "default" image is seemingly randomly chosen as it is unlikely to have the same schematic ID as the one on the node or might be an older version than what is currently installed.

An implicit upgrade to a "default" image would be unexpected for a Talos administrator, but should be generally safe as it is assumed it is a reversible action due to Talos' A/B upgrades and the existence of the talosctl rollback command.

However, as shown in the discussion linked above - there is an edge case. For Talos nodes with secure boot enabled, upgrading to a non-secure boot image completely bricks the node, as systemd-boot gets replaced.

To recover from such situation, at the very least requires physical access and a live USB. However, if combined with TPM-encrypted partitions/disks, booting from an USB causes the node to get stuck in a "booting" state, as the TPM will refuse to unseal the keys for the partitions/disks. Thus wiping the STATE and EPHEMERAL partitions and starting fresh becomes the only option, leading to data loss

Feature Request

With that in mind, I would like to suggest a few ideas on how to improve the upgrade command, so that you cannot accidentally shoot yourself in the foot:

  • Remove the default value for the --image option
  • Perform validation checks when upgrading and require --force (or something to that effect) if upgrading to an older image, from a secure boot to a non-secure boot image, or if changing image arch.
  • Change the default --image value to be "what is currently specified in the machine config" or require --image to be provided if --insecure is used

Bonus: Add a "boot into maintanance mode without wiping the system disk" boot entry to the Talos ISO

astro-stan avatar Nov 06 '25 11:11 astro-stan

I want to rephrase the goal.

I think we should drop default --image, and even hide it completely as a legacy way.

Instead, we should have:

  • Image Factory host, defaults to factory.talos.dev
  • schematic ID, defaults to the machine's schematic
  • Talos version, defaults to talosctl version
  • SecureBoot/non-SecureBoot - defaults to machine's status
  • Platform, defaults to the machine's platform

With that in hand, talosctl can build correct upgrade URL, with most of the time just specifying one flag should produce a correct image reference.

smira avatar Nov 06 '25 15:11 smira