omni
omni copied to clipboard
[bug] Failover from node conditions seemingly not possible. Have to delete the cluster
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
It seems You can destroy an Omni-provisioned Talos cluster with a single node in the diskpressure=true
state:
- create a cluster
- deploy openebs-localpv (right now there is a bug where the provisioner fails to delete the pv when the disk is under pressure)
- wait for the node to become unschedulable (
diskpressure=true
) - Now try to recover from that "easily":
-
talosctl list ...
allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to delete -
node-shell
or any equivalent container is unavailable because the node is unschedulable -
talosctl node reset
returnsPermissionDenied
- Deleting the node from the GUI cluster menu gives you
failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
-
omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID>
returnsfailed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
- Resetting the machine from the ISO puts both the Cluster and machine in an inconsistent state where the machine has status
unknown
in Omni "cluster" menu and the machine goes intostreaming success
loop as described in #180 - Deleting the machine from Omni and resetting the machine from the ISO puts the machine in an inconsistent state where the machine has status
unknown
in Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into{component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running}
and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.
Expected Behavior
Any/All of the strategies describe in 4. above should work and allow for an easy failover from nodePressure
issues in a single node in an Omni-provisionned cluster.
Steps To Reproduce
See above
What browsers are you seeing the problem on?
No response
Anything else?
Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3