[bug] Failover from node conditions seemingly not possible. Have to delete the cluster
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
It seems You can destroy an Omni-provisioned Talos cluster with a single node in the diskpressure=true state:
- create a cluster
- deploy openebs-localpv (right now there is a bug where the provisioner fails to delete the pv when the disk is under pressure)
- wait for the node to become unschedulable (
diskpressure=true) - Now try to recover from that "easily":
-
talosctl list ...allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to delete -
node-shellor any equivalent container is unavailable because the node is unschedulable -
talosctl node resetreturnsPermissionDenied - Deleting the node from the GUI cluster menu gives you
failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController" -
omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID>returnsfailed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController" - Resetting the machine from the ISO puts both the Cluster and machine in an inconsistent state where the machine has status
unknownin Omni "cluster" menu and the machine goes intostreaming successloop as described in #180 - Deleting the machine from Omni and resetting the machine from the ISO puts the machine in an inconsistent state where the machine has status
unknownin Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into{component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running}and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.
Expected Behavior
Any/All of the strategies describe in 4. above should work and allow for an easy failover from nodePressure issues in a single node in an Omni-provisionned cluster.
Steps To Reproduce
See above
What browsers are you seeing the problem on?
No response
Anything else?
Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3
For this I think we need to wait for Talos 1.8. If we reset the whole system disk it will confuse Omni etcd audit. That's the main reason we didn't enable EPHEMERAL partition reset yet.
1.8 will allow us to partially reset EPHEMERAL without touching etcd.
Hi @Unix4ever
Can you please expose talosctl node reset to Omni-provisioned clusters in 1.8 ? Right now the CLI returns permissionDenied.
Debugging nonwhistanding, I think that would be the most consistent way to perform fail-over quickly in production when any kind of node-related issues arise. Particularly those due to node-state decay over time.
BTW I did not check the code for talosctl node reset but I assume it should do something like :
- cordon the node
- evict the workloads
- reset the node to the initial state (at cluster init, not at machine init)
- uncordon the node after the reset is over
if this is not what talosctl node reset does, there should be a CLI command that does the above IMO. Otherwise fail-over is a true PITA... which doesn't make sense for an immutable OS.
Thanks B./
We can expose reset as soon as it can do partial reset. Without touching etcd state.
Partial reset is in plans for 1.8.
I guess we can also do an experiment with how Omni handles full Ephemeral partition reset.