elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Node shutdown API should handle waiting for ILM to run before marking shutdown as stalled

Open jaymode opened this issue 2 years ago • 2 comments

When using the node shutdown API, we have observed that the API will indicate that shutdown is stalled due to a shard not being able to move. This can happen after several hours of progress on a busy cluster.

Currently, the expectation is that the caller observes the stalled status and then waits some reasonable amount of time before giving up on shutdown making progress. One reason is to give ILM enough time to potentially address any issues that would prevent shards from migrating. This has the unfortunate side effect of pushing more logic to the callers of the API whereas if Elasticsearch handled this we would be much more likely to know if the shard migration will ever be un-stalled and/or make better decisions about how long to wait.

jaymode avatar Aug 19 '22 14:08 jaymode

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar Aug 19 '22 14:08 elasticsearchmachine

We'll bring this up in the fix-it meeting.

grcevski avatar Sep 21 '22 15:09 grcevski

We encountered another situation in which ES reported a shutdown as STALLED prematurely. In this case, a brief network outage caused the replacement target node to leave the cluster. It rejoined the cluster again a few seconds later, but while it was gone there was nowhere to allocate the shards on the replacement source node which results in a STALLED state.

DaveCTurner avatar Sep 30 '22 07:09 DaveCTurner

Digging into this, I think the behavior @DaveCTurner describes above should be a separate issue, and is possibly expected behavior. If a node is actually offline, then the migration is indeed stalled - it can't happen while the target node is offline, which is outside of the cluster's control, and may well need the intervention of another system or a human, which is how we originally defined STALLED. We would need to decided the criteria we'd want to redefine for that situation, since there's no way to know up front what's a brief network outage vs. an actual node or connection failure that needs to be repaired.

gwbrown avatar Aug 02 '23 21:08 gwbrown

We would need to decided the criteria we'd want to redefine for that situation, since there's no way to know up front what's a brief network outage vs. an actual node or connection failure that needs to be repaired.

Agreed - my point is that we need that logic somewhere (either ES, or the control plane, or in some SRE runbook somewhere) and we should generally prefer to put it in ES. We already have various other opinions about different lengths of network outage (30s timeout on health checks, 10s timeout on handshakes, etc).

DaveCTurner avatar Aug 03 '23 07:08 DaveCTurner