elasticsearch-operator icon indicating copy to clipboard operation
elasticsearch-operator copied to clipboard

rolling restart policy

Open djschny opened this issue 7 years ago • 3 comments

There are times when an entire rolling restart of the cluster would be done. These include scenarios such as:

  • upgrading to a new version of ES
  • restarting nodes to pickup node level configuration changes
  • other maintenance activity

In these situations, a rolling restart of all nodes is needed. However just letting Kubernetes default Replication Controller do its standard policy unfortunately won't work for Elasticsearch. Instead it needs to make calls to the elasticsearch API to both monitor and make configuration changes as each node is restarted before moving on the next.

The detailed procedure can be found in the Elasticsearch reference docs, but below is the general synopsis:

  1. Perform a synced flush
  2. Disable shard allocation
  3. Stop the container for a pod, bring up the container (potentially with new ES version)
  4. Monitor the Elasticsearch API and wait for both the cluster to be yellow and the node to be listed in the cluster again.
  5. Enable shard allocation and wait for the cluster to go green
  6. Repeat steps two through five for each data node in the cluster.

Semi-related to https://github.com/upmc-enterprises/elasticsearch-operator/issues/17 as well

djschny avatar Jun 07 '17 21:06 djschny

I did this manually on a 10 node cluster ( 2 master, 3 client, 5 data) in kubernetes today in order to pick up new certs from a secret... it worked - it seemed like I had to get the masters updated before the other nodes would be happy since everyone is talking to each other when they come up. Eventually we went green and all was good.

@djschny do you recommend doing the masters first in that scenario (when updating certs?)

ethanwinograd avatar Jul 12 '17 19:07 ethanwinograd

@djschny do you recommend doing the masters first in that scenario (when updating certs?)

Yes @ethanwinograd. Sorry I should have stated that explicitly in the steps.

djschny avatar Jul 12 '17 20:07 djschny

Hi guys, I'm new to the whole operators world so bare with me. I know that this thread is old, but I'll give it a try: I understand that Kubernetes can decide to shutdown a pod & start it on a different node. When it does that to an ES data node I expect that this operator will stop "shared allocation" in order to avoid heavy I/O across the cluster. This is the best practice as @djschny wrote above, in case of an ES node restart. I noticed that this operator does not do that. Isn't it essential for a big ES cluster? For example: we have a 10 nodes ES cluster - whenever one node goes down, without disabling shard allocation, the I/O operation to copy shards around are affecting query latency badly till the shard re-alloc finishes. this can take dozens of minutes in stop & start of an ES node.

Thanks, Lior

lior-k avatar Apr 17 '19 12:04 lior-k