terraform-provider-k0s icon indicating copy to clipboard operation
terraform-provider-k0s copied to clipboard

Node removal not supported, however it "works" in unexpected manner

Open danielskowronski opened this issue 2 years ago • 1 comments
trafficstars

Some time ago, k0sctl added support for node removal.

This provider calls the necessary phase to reset controllers, but it doesn't prepare hosts list, so they can be removed. Data structure ClusterResourceModelHost misses Reset field, and there's no logic that would translate host removal from state to flag update, so it can be picked up by phase manager.

It's quite problematic, when after removal, a new host is added with the same IP, as this is the unique ID for many k0s structures - it results in split-brain. The cluster still tries to connect to a new VM using IP that was not removed (mainly from etcd) and the new VM is stuck on cluster init phase, but serves requests immediately. Control-plane HA requires a load-balancer, so without sophisticated checks it can easily serve two clusters at the same time.

As per docs, the workaround seems to be to manually execute k0s etcd leave --peer-address IP_ADDR on an alive node - in most cases the node we want to delete, but it gets tricky if we're rebuilding a crashed VM. More so, since destroy time provisioners in TF only work with clean destroy - not even with taint.

danielskowronski avatar Nov 13 '23 10:11 danielskowronski

This is not yet supported in k0sctl itself - https://github.com/k0sproject/k0sctl/issues/603

danielskowronski avatar Dec 04 '23 17:12 danielskowronski