talos icon indicating copy to clipboard operation
talos copied to clipboard

`talosctl upgrade --image some:image` does not re-pull the image

Open utkuozdemir opened this issue 3 years ago • 1 comments

Bug Report

Description

  • Run talosctl upgrade --image some:image with an invalid installer image.
  • Fix the image and push it with the same tag docker push some:image
  • Run talosctl upgrade --image some:image again, it will not re-pull the image and keep failing.

We can introduce a flag to the upgrade command like --force-pull to enforce pulling of image.

Logs

172.20.0.2: [talos] upgrade request received: preserve true, staged false, force false
172.20.0.2: [talos] validating "ghcr.io/utkuozdemir/talos-installer:test-break"
172.20.0.2: machined Unknown [/machine.MachineService/Upgrade] 2.473929476s unary error validating installer image "ghcr.io/utkuozdemir/talos-installer:test-break": failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/bin/installer": stat /bin/installer: no such file or directory: unknown (:authority=localhost;content-type=application/grpc;proxyfrom=172.20.0.2,172.20.0.3,172.20.0.4;talos-role=os:admin;user-agent=grpc-go/1.47.0)
....
....
....
172.20.0.2: [talos] upgrade request received: preserve true, staged false, force false
172.20.0.2: [talos] validating "ghcr.io/utkuozdemir/talos-installer:test-break"
172.20.0.2: machined Unknown [/machine.MachineService/Upgrade] 63.348966ms unary error validating installer image "ghcr.io/utkuozdemir/talos-installer:test-break": failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/bin/installer": stat /bin/installer: no such file or directory: unknown (:authority=localhost;content-type=application/grpc;proxyfrom=172.20.0.2,172.20.0.3,172.20.0.4;talos-role=os:admin;user-agent=grpc-go/1.47.0)

utkuozdemir avatar Jun 15 '22 18:06 utkuozdemir

The root cause is that image is pulled and cached in the system containerd in memory (in tmpfs).

So rebooting a node is enough as a workaround.

The proper fix is to pull the image always while processing the upgrade API request, but use the cached image when running the actual upgrade.

smira avatar Jun 21 '22 15:06 smira