monogon icon indicating copy to clipboard operation
monogon copied to clipboard

metropolis: implement removing nodes

Open fionera opened this issue 2 years ago • 8 comments

Currently there is no way to remove a node from a cluster

fionera avatar Aug 31 '23 12:08 fionera

There's two things here:

  1. A proper, safe decomissioning flow, including draining, Kubernetes state sync, etc.
  2. A way to just remove dead nodes and generally 'clean up'.

We mostly need the latter, so that's what we should probably implement first. Then the decomissioning flow will include the latter, either automatically or as part of the process to be done by operators.

q3k avatar Sep 05 '23 12:09 q3k

(good first issue for the 'clean up' part)

q3k avatar Sep 05 '23 12:09 q3k

Isnt this already implemented?

fionera avatar Jun 21 '24 21:06 fionera

Partially - we can remove everything but control plane nodes.

q3k avatar Jul 02 '24 12:07 q3k

We need this to be done until mid-september

fionera avatar Jul 04 '24 12:07 fionera

With https://review.monogon.dev/c/monogon/+/3343 merged we will be able to safely and fully remove nodes from consensus.

A full decommissioning flow is blocked by SPIFFE support and general auth refactor, as that mostly revolves around distributing some revocation lists to make sure a node can't reconnect to the cluster. However, we can still remove non-decommissioned nodes by setting DeleteNodeRequest.SafetyBypassNotDecommissioned.

q3k avatar Aug 21 '24 14:08 q3k

After removing etcd membership from a node, etcd panics which takes down the entire node. The panic is in IsLocalMemberLearner, stack trace:

go.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).IsLocalMemberLearner(0xc000879380)
	external/gazelle~~go_deps~io_etcd_go_etcd_server_v3/etcdserver/api/membership/cluster.go:859 +0x26d
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).IsLearner(0xc001020c18?)
	external/gazelle~~go_deps~io_etcd_go_etcd_server_v3/etcdserver/server.go:2813 +0x1a
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*maintenanceServer).Status(0xc001020bb0, {0xc001d38638?, 0xc001d38638?}, 0xc001d38690?)
	external/gazelle~~go_deps~io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/maintenance.go:228 +0x137
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*authMaintenanceServer).Status(0x4a5c340?, {0x3122cf8?, 0xc0026f2f00?}, 0x2d0ebd8?)
	external/gazelle~~go_deps~io_etcd_go_etcd_server_v3/etcdserver/api/v3rpc/maintenance.go:306 +0x25

I'm not sure how to fix the panic.

  • We could just return false in IsLocalMemberLearner instead of panicking. This is technically correct in that a node that is no longer a member is also not a learner anymore, but clients might not expect this.
  • We could change the function to return (isLearner bool, ok bool) or similar to force clients to handle the case where the local node is not a member. But this is a breaking change in a public interface and thus also not ideal.
  • Some existing callers of IsLearner first check IsMemberExist to avoid the panic, but that does not completely solve the problem because it's possible that a member is removed between the two calls.

Another way to improve the situation is to move etcd into a separate process such that a panic in etcd does not restart the entire machine. This is #349.

jscissr avatar Sep 10 '24 15:09 jscissr

Removing consensus nodes is now possible with these changes merged:

https://review.monogon.dev/c/monogon/+/3437 https://review.monogon.dev/c/monogon/+/3438 https://review.monogon.dev/c/monogon/+/3454

There are still some rough edges that could be improved:

Bootstrap node does not stop consensus

The bootstrap node will not stop consensus when it has never rebooted and the role is removed, because the bootstrap data takes priority over the role. https://github.com/monogon-dev/monogon/blob/d5538b52d7a8739f7123458c10973be36b27b9ff/metropolis/node/core/roleserve/worker_controlplane.go#L134-L136

Move leadership

When removing the consensus role from a node which is currently either etcd or curator leader, there is some downtime until a new leader is elected. It would be nicer to first move leadership to another node in this case. The challenge with moving the leadership is that maintenance.MoveLeader can only be called on the etcd leader node, and etcd leadership is currently independent from curator leadership. We could solve that by making the curator leadership follow etcd leadership, which might have performance benefits as well. See discussion in https://review.monogon.dev/c/monogon/+/3437

jscissr avatar Sep 26 '24 14:09 jscissr