elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

More discriminating `RESTART` shutdown logic

Open DaveCTurner opened this issue 10 months ago • 3 comments

In a rolling restart we recommend users wait for the cluster health to reach green in between node restarts, and some users will also wait for rebalancing to complete each time. This is unnecessarily conservative: it's safe to restart a node while the cluster health is still yellow after the previous restart as long as the initializing shards are unrelated to the shards on the node that is to be restarted next.

It's not reasonable to ask users to compute when it's safe to restart a node themselves, but nor is it especially reasonable to wait for green health after each node since this may extend the restart time by hours or even days in a large cluster. I believe the shutdown API should be able to solve this by reporting shardMigrationStatus == COMPLETE on a RESTART shutdown when all the shards on the target node are fully replicated. That's different from today's behaviour in which a RESTART shutdown has shardMigrationStatus == COMPLETE immediately, forcing users to use other APIs (e.g. cluster health) to wait as necessary.

DaveCTurner avatar Apr 25 '24 15:04 DaveCTurner

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Apr 25 '24 15:04 elasticsearchmachine

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Apr 25 '24 15:04 elasticsearchmachine

Hey @DaveCTurner , As per my understanding here we want to change the shardMigrationStatus according to different conditions ( STALLED, IN_PROCESS, COMPLETED, NOT_STARTED) and also we do not want to update shardMigrationStatus when RESTART ( shutdownType ) is triggered. looking at the code

if (SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownType)) {
            return new ShutdownShardMigrationStatus(
                SingleNodeShutdownMetadata.Status.COMPLETE,
                0,
                "no shard relocation is necessary for a node restart",
                null
            );
 }

here we are marking status as COMPLETE when shutdownType is RESTART but if above condition is removed code will behave exactly same as we want ( correct me if I am wrong here ) which is based on different condition we will update the status . Am I missing something ? Any pointers ? TIA..

prathm3 avatar May 03 '24 18:05 prathm3

Hey @DaveCTurner , As per my understanding here we want to change the shardMigrationStatus according to different conditions ( STALLED, IN_PROCESS, COMPLETED, NOT_STARTED) and also we do not want to update shardMigrationStatus when RESTART ( shutdownType ) is triggered. looking at the code

if (SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownType)) {
            return new ShutdownShardMigrationStatus(
                SingleNodeShutdownMetadata.Status.COMPLETE,
                0,
                "no shard relocation is necessary for a node restart",
                null
            );
 }

here we are marking status as COMPLETE when shutdownType is RESTART but if above condition is removed code will behave exactly same as we want ( correct me if I am wrong here ) which is based on different condition we will update the status . Am I missing something ? Any pointers ? TIA..

prathm3 avatar May 03 '24 18:05 prathm3

Hi @prathm3, thanks for your interest here. I'm not sure I understand your question, but are you asking because you're interested in contributing a solution? This is quite a subtle issue and needs some discussion by the team before we decide on a path forwards. I wouldn't recommend on working on this area for now..

DaveCTurner avatar May 03 '24 18:05 DaveCTurner

Hi @prathm3, thanks for your interest here. I'm not sure I understand your question, but are you asking because you're interested in contributing a solution? This is quite a subtle issue and needs some discussion by the team before we decide on a path forwards. I wouldn't recommend on working on this area for now..

DaveCTurner avatar May 03 '24 18:05 DaveCTurner