sidero icon indicating copy to clipboard operation
sidero copied to clipboard

Fixable/transient failures during node install require reprovisioning

Open magicite opened this issue 3 years ago • 1 comments

When using sidero metal to provision clusters I've noticed that if there's an issue during node install (e.g., node pxe boots but install fails due a syntax error in configPatches for the corresponding Server object), on reboot at pxe boot the node is directed to boot from disk rather than retry the install. Since the disk is empty it either sits there with an error saying no boot device or enters a reboot loop.

It feels like a node should continue to pxe boot from the network until an install is successful, rather than be directed to boot from disk after it has pxe booted once. The node installer could phone home after a successful install, allowing the tftp server to only then direct the node to boot from disk.

magicite avatar Oct 05 '22 19:10 magicite

This is how exactly how things should work in recent Sidero. Talos reports installed status via SideroLink back to Sidero, so Sidero will keep booting from network as long as the node hasn't finished the install.

Moreover, install/config errors should be presented as part of the Machine and MetalMachine statuses.

smira avatar Oct 10 '22 10:10 smira

I am using the latest sidero, and for this example talos 1.1.1:

[root@dill04 demo]# clusterctl upgrade plan
Checking cert-manager version...
Cert-Manager is already up to date

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE       TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-talos         cabpt-system    BootstrapProvider        v0.5.5            Already up to date
control-plane-talos     cacppt-system   ControlPlaneProvider     v0.4.10           Already up to date
cluster-api             capi-system     CoreProvider             v1.2.4            Already up to date
infrastructure-sidero   sidero-system   InfrastructureProvider   v0.5.5            Already up to date

You are already up to date!

I purposefully put in a bogus value for my any serverclass to choose /dev/sdx as the install disk. On one of the nodes at boot we get

[   65.139702] [talos] task loadConfig (1/1): failed: failed to validate config: 1 error occurred:
[   65.243967]  * specified install disk does not exist: "/dev/sdx"
[   65.315959]
[   65.333869] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "networ}
[   65.506865] [talos] phase config (5/5): failed
[   65.560155] [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KmsgLogD}
[   65.801751] [talos] initialize sequence: failed
[   67.620190] [talos] error running phase 5 in initialize sequence: task 1/1: failed, failed to validate co:
[   67.758830]  * specified install disk does not exist: "/dev/sdx"
[   67.830837]
[   68.022460] [talos] failed to open meta: file does not exist
[   68.090354] [talos] rebooting in 10 seconds

and peeking at one of the MetalMachines we get

status:
  addresses:
  - address: talos-172-30-223-152
    type: Hostname
  conditions:
  - lastTransitionTime: "2022-10-24T21:24:16Z"
    message: 'Get "https://172.30.223.252:6443/api/v1/nodes?labelSelector=metal.sidero.dev%!F(MISSING)uuid%!D(MISSING)4c4c4544-0054-5a10-804a-c7c04f515631":
      dial tcp 172.30.223.252:6443: connect: no route to host'
    reason: ProviderUpdateFailed
    severity: Warning
    status: "False"
    type: ProviderSet
  - lastTransitionTime: "2022-10-24T21:27:34Z"
    message: "failed to validate config: 1 error occurred:\n\t* specified install
      disk does not exist: \"/dev/sdx\"\n\n"
    reason: TalosConfigLoadFailed
    severity: Error
    status: "False"
    type: TalosConfigLoaded
  ready: true

and yet hitting the ipxe endpoint says to exit / boot from disk:

curl http://172.30.223.27:8081/ipxe?uuid=4c4c4544-0054-5a10-804a-c7c04f515631
#!ipxe
exit

This is confirmed by what's in my console on the node, which has this text (scraped from an image so typos but they aren't important):

iPXE1.21.1+git+9062544+sidero
- Open Source Network Boot Firmware
--
http://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP AoE ELF MBOOT PXE bzImage Menu PXEXT
neto: d4: ae:52: aa:93:8d using undionly on 0000:01:00.0 (Ethernet) [closedl
[Link:up,TX: 0 TXE:1 RX:0 RYE: 01
[TXE: 1 x "Network unreachable (http://ipxe.org/28086011)"]
Waiting for link-up on net0..ok
Configuring (net0 d4: ae:52:aa:93:8d).ok
net: 172.30.223.122/255.255.255.0 qw172.30.223.1
http://172.30.223.27:8081/ipxe... ok

No boot device available
Current boot mode is set to BIOS
Please ensure compatible bootable media is available.
Use the system setup program to change the boot mode as needed.
Strike F1 to retry boot, F2 for system setup, F11 for BIOS boot manager.

magicite avatar Oct 24 '22 21:10 magicite

In recent versions of sidero it's working as expected. @magicite time to close this one?

linuxmaniac avatar Sep 19 '23 09:09 linuxmaniac

If you've confirmed it's fixed, then go ahead and close it. I haven't tried this in a few months, but might be able to try again sometime in October.

magicite avatar Sep 22 '23 14:09 magicite

That is actually fixed to some extent at least, e.g. invalid machine config will keep the node in a PXE boot loop until the machine config is fixed.

smira avatar Sep 23 '23 07:09 smira