omni icon indicating copy to clipboard operation
omni copied to clipboard

[bug] Failover from node conditions seemingly not possible. Have to delete the cluster

Open bernardgut opened this issue 9 months ago • 3 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

It seems You can destroy an Omni-provisioned Talos cluster with a single node in the diskpressure=true state:

  1. create a cluster
  2. deploy openebs-localpv (right now there is a bug where the provisioner fails to delete the pv when the disk is under pressure)
  3. wait for the node to become unschedulable (diskpressure=true)
  4. Now try to recover from that "easily":
  • talosctl list ... allows you to see the data that wasnt garbage-collected that you have to delete but there is no option to delete
  • node-shell or any equivalent container is unavailable because the node is unschedulable
  • talosctl node reset returns PermissionDenied
  • Deleting the node from the GUI cluster menu gives you failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
  • omnictl delete machinesetnodes.omni.sidero.dev <MACHINEID> returns failed to update: resource [MachineSetNodes.omni.sidero.dev](http://machinesetnodes.omni.sidero.dev/)(default/<MACHINEID>@2) is owned by "MachineSetNodeController"
  • Resetting the machine from the ISO puts both the Cluster and machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu and the machine goes into streaming success loop as described in #180
  • Deleting the machine from Omni and resetting the machine from the ISO puts the machine in an inconsistent state where the machine has status unknown in Omni "cluster" menu, doesn't rejoin in Omni "machines" menu and the machine goes into {component : controller-runtime, controller: siderolink.ManagerController, error: error provisioning : rpc error: code = Unknown desc = resource Links.omni.sideo.dev(default/MACHINEID) is not in phase running} and never rejoin the Omni instance. You are not stuck with a 2-node cluster and a node that cannot rejoin omni until you delete the cluster.

Expected Behavior

Any/All of the strategies describe in 4. above should work and allow for an easy failover from nodePressure issues in a single node in an Omni-provisionned cluster.

Steps To Reproduce

See above

What browsers are you seeing the problem on?

No response

Anything else?

Talos 1.7.0 Omni 0.34 Kubernetes 1.29.3

bernardgut avatar Apr 30 '24 11:04 bernardgut