omni
omni copied to clipboard
[feature] Automatic replacement of dead machine
Problem Description
I have defined a cluster using a template and machine classes. The cluster needs to have 3 controlplanes and 3 workers. I have more than 3 of each of these machines. The cluster is created and the cluster is happy. Somehow one of my machines goes kaboom, and is no longer available.
Cluster "test" RUNNING Not Ready (5/6) (healthy/total)
├── Kubernetes Upgrade Done
├── Talos Upgrade Done
├── Control Plane "test-control-planes" Running Not Ready (2/3)
│ ├── Load Balancer Ready
│ ├── Status Checks OK
│ ├── Machine "8b934aff-edd4-4737-96f1-5fbdcf0c2d45" RUNNING Not Ready Unreachable
│ ├── Machine "a92d30cb-e2f3-4394-92c8-d5b3a9fac006" RUNNING Ready
│ └── Machine "b3ea639e-0585-4efb-bf34-2c4fc7471ad4" RUNNING Ready
└── Workers "test-workers" Running Ready (3/3)
├── Machine "4c4c4544-0059-4810-8032-b8c04f484232" RUNNING Ready
├── Machine "4c4c4544-0059-4c10-8030-b8c04f484232" RUNNING Ready
└── Machine "4c4c4544-0059-4c10-8032-b8c04f484232" RUNNING Ready
The cluster will stay in this state, and not try to repair itself.
Solution
If the node has been in the Not Ready Unreachable
state for an extended time (maybe configurable) it should be dropped from the cluster and a new machine from the appropriate machine classes will be added. There should be no interaction required from the user. If the machine comes back, it should be wiped and be added back to the pool for future work.
Alternative Solutions
I tried to remove the machine using the interface but that did not work when I tried to destroy from the cluster. At this point the cluster has a dead machine that I can not remove. I tried to sync the cluster definition but that did not change anything.
Notes
this is using the self hosted version of omni v0.31.0