omni [bug] Failover from node conditions seemingly not possible. Have to delete the cluster

[bug] Failover from node conditions seemingly not possible. Have to delete the cluster

Open bernardgut opened this issue 9 months ago • 3 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

It seems You can destroy an Omni-provisioned Talos cluster with a single node in the diskpressure=true state:

create a cluster
deploy openebs-localpv (right now there is a bug where the provisioner fails to delete the pv when the disk is under pressure)
wait for the node to become unschedulable (diskpressure=true)
Now try to recover from that "easily":

talosctl list ... allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to delete
node-shell or any equivalent container is unavailable because the node is unschedulable
talosctl node reset returns PermissionDenied
Deleting the node from the GUI cluster menu gives you failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID> returns failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
Resetting the machine from the ISO puts both the Cluster and machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu and the machine goes into streaming success loop as described in #180
Deleting the machine from Omni and resetting the machine from the ISO puts the machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into {component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running} and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.

Expected Behavior

Any/All of the strategies describe in 4. above should work and allow for an easy failover from nodePressure issues in a single node in an Omni-provisionned cluster.

Steps To Reproduce

See above

What browsers are you seeing the problem on?

No response

Anything else?

Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3

Apr 30 '24 11:04 bernardgut

omni omni copied to clipboard

[bug] Failover from node conditions seemingly not possible. Have to delete the cluster

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

What browsers are you seeing the problem on?

Anything else?

omni
omni copied to clipboard