sidero
sidero copied to clipboard
Fixable/transient failures during node install require reprovisioning
When using sidero metal to provision clusters I've noticed that if there's an issue during node install (e.g., node pxe boots but install fails due a syntax error in configPatches for the corresponding Server object), on reboot at pxe boot the node is directed to boot from disk rather than retry the install. Since the disk is empty it either sits there with an error saying no boot device or enters a reboot loop.
It feels like a node should continue to pxe boot from the network until an install is successful, rather than be directed to boot from disk after it has pxe booted once. The node installer could phone home after a successful install, allowing the tftp server to only then direct the node to boot from disk.
This is how exactly how things should work in recent Sidero. Talos reports installed status via SideroLink back to Sidero, so Sidero will keep booting from network as long as the node hasn't finished the install.
Moreover, install/config errors should be presented as part of the Machine and MetalMachine statuses.
I am using the latest sidero, and for this example talos 1.1.1:
[root@dill04 demo]# clusterctl upgrade plan
Checking cert-manager version...
Cert-Manager is already up to date
Checking new release availability...
Latest release available for the v1beta1 API Version of Cluster API (contract):
NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-talos cabpt-system BootstrapProvider v0.5.5 Already up to date
control-plane-talos cacppt-system ControlPlaneProvider v0.4.10 Already up to date
cluster-api capi-system CoreProvider v1.2.4 Already up to date
infrastructure-sidero sidero-system InfrastructureProvider v0.5.5 Already up to date
You are already up to date!
I purposefully put in a bogus value for my any serverclass to choose /dev/sdx as the install disk. On one of the nodes at boot we get
[ 65.139702] [talos] task loadConfig (1/1): failed: failed to validate config: 1 error occurred:
[ 65.243967] * specified install disk does not exist: "/dev/sdx"
[ 65.315959]
[ 65.333869] [talos] reconfigured wireguard link {"component": "controller-runtime", "controller": "networ}
[ 65.506865] [talos] phase config (5/5): failed
[ 65.560155] [talos] controller failed {"component": "controller-runtime", "controller": "runtime.KmsgLogD}
[ 65.801751] [talos] initialize sequence: failed
[ 67.620190] [talos] error running phase 5 in initialize sequence: task 1/1: failed, failed to validate co:
[ 67.758830] * specified install disk does not exist: "/dev/sdx"
[ 67.830837]
[ 68.022460] [talos] failed to open meta: file does not exist
[ 68.090354] [talos] rebooting in 10 seconds
and peeking at one of the MetalMachines we get
status:
addresses:
- address: talos-172-30-223-152
type: Hostname
conditions:
- lastTransitionTime: "2022-10-24T21:24:16Z"
message: 'Get "https://172.30.223.252:6443/api/v1/nodes?labelSelector=metal.sidero.dev%!F(MISSING)uuid%!D(MISSING)4c4c4544-0054-5a10-804a-c7c04f515631":
dial tcp 172.30.223.252:6443: connect: no route to host'
reason: ProviderUpdateFailed
severity: Warning
status: "False"
type: ProviderSet
- lastTransitionTime: "2022-10-24T21:27:34Z"
message: "failed to validate config: 1 error occurred:\n\t* specified install
disk does not exist: \"/dev/sdx\"\n\n"
reason: TalosConfigLoadFailed
severity: Error
status: "False"
type: TalosConfigLoaded
ready: true
and yet hitting the ipxe endpoint says to exit / boot from disk:
curl http://172.30.223.27:8081/ipxe?uuid=4c4c4544-0054-5a10-804a-c7c04f515631
#!ipxe
exit
This is confirmed by what's in my console on the node, which has this text (scraped from an image so typos but they aren't important):
iPXE1.21.1+git+9062544+sidero
- Open Source Network Boot Firmware
--
http://ipxe.org
Features: DNS HTTP HTTPS iSCSI TFTP AoE ELF MBOOT PXE bzImage Menu PXEXT
neto: d4: ae:52: aa:93:8d using undionly on 0000:01:00.0 (Ethernet) [closedl
[Link:up,TX: 0 TXE:1 RX:0 RYE: 01
[TXE: 1 x "Network unreachable (http://ipxe.org/28086011)"]
Waiting for link-up on net0..ok
Configuring (net0 d4: ae:52:aa:93:8d).ok
net: 172.30.223.122/255.255.255.0 qw172.30.223.1
http://172.30.223.27:8081/ipxe... ok
No boot device available
Current boot mode is set to BIOS
Please ensure compatible bootable media is available.
Use the system setup program to change the boot mode as needed.
Strike F1 to retry boot, F2 for system setup, F11 for BIOS boot manager.
In recent versions of sidero it's working as expected. @magicite time to close this one?
If you've confirmed it's fixed, then go ahead and close it. I haven't tried this in a few months, but might be able to try again sometime in October.
That is actually fixed to some extent at least, e.g. invalid machine config will keep the node in a PXE boot loop until the machine config is fixed.