Removing nodes that can't join the cluster

Open barkbay opened this issue 1 year ago • 1 comments

It is not possible for the operator to remove nodes which either:

Are never going to join the cluster (for example because of a capacity issue)
Used to be part of the cluster, but for some reason can no longer start or be created.

The reason is that the operator must first retrieve the node id in order to call the shutdown API:

	for _, node := range leavingNodes {
		nodeID, err := ns.lookupNodeID(node)
		if err != nil {
			return err
		}

But if the node cannot join the cluster, it's then not possible to get that id, and consequently it is not possible for the operator to use the shutdown API.

A symptom for that issue is:

Pod cannot be scheduled (Pending), start or join the cluster.
Following error is endlessly printed while the user want to downscale or remove the nodeSet:

node xxxx-es-xxxx-0 currently not member of the cluster

This situation was already discussed in this thread, but we never concluded what was the most appropriate behaviour in that case.

How to solve this?

Data integrity should be one of the operator top priority, therefore we should skip the shutdown API and remove the node only and only if we are confident in the fact that this will not result in data loss.

There are few situations where I think this is what can be done:

If the PVC is not bound (we may have a race condition with the client cache though: PV is actually bound but this is not reflected in the client yet)
Cluster is green, all the shards are allocated, it should be safe to downscale, maybe the most promising solution?

An alternative would be to improve the shutdown API so we can use the node external id: https://github.com/elastic/elasticsearch/issues/88222

Workaround

In the meantime I think the only workaround is to manually and gradually downscale the nodeSet, by reducing the underlying StatefulSet size until the operator can recover. Note that this solution is not ideal as we want the StatefulSets to be an implementation detail, not being directly handled by the user.

Jun 17 '24 07:06 barkbay

@barkbay I agree with all of your points above. While browsing our documentation recently I found these instructions:

https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-common-problems.html#k8s-common-problems-scale-down

They are currently targeted to one specific problem but the workaround goes into the same direction as a potential workaround for the problem discussed here would go. I wonder if we should add similar instructions for this problem?

Jun 17 '24 13:06 pebrc