elasticsearch-operator icon indicating copy to clipboard operation
elasticsearch-operator copied to clipboard

Enforce startup sequence

Open stevesloka opened this issue 7 years ago • 11 comments

Some users have seen issues where if the masters aren't initialized yet, and data nodes come up, they fail. The operator should handle this for the user and verify each component is healthy before starting the next.

This could be custom logic in the operator or readiness probes.

stevesloka avatar May 18 '17 13:05 stevesloka

Correct, the desired startup sequence of a cluster would be as follows:

  • Start master node(s) and wait until the Elasticsearch API says all expected master node(s) are in the cluster
  • Start data nodes and wait for them to all join the cluster
  • Start any client nodes (if there are any)

If the entire cluster is being shutdown, then the reverse is to be done. Client nodes, then data nodes, then master nodes.

djschny avatar Jun 07 '17 21:06 djschny

cc @munnerz

mattbates avatar Oct 04 '17 15:10 mattbates

Hey @djschny, if we start the master nodes first and the client nodes last, how do we check for cluster health? Right now I can query the clients for cluster health, but in your list it's the last bit.

Do I enable the http interface on all nodes and query passing local=true? (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html)

stevesloka avatar Oct 11 '17 18:10 stevesloka

Something recently added to k8s in 1.8: Pod Priority and Preemption

pieterlange avatar Oct 12 '17 21:10 pieterlange

In this scenario for administrative task like this you will want to query the master node(s).

djschny avatar Oct 12 '17 22:10 djschny

http should be enabled on all nodes always

djschny avatar Oct 12 '17 22:10 djschny

Is _cluster/health the right endpoint that you would.use?

stevesloka avatar Oct 12 '17 23:10 stevesloka

It depends upon what level of error reporting and handling you want to do when something goes wrong. _cluster/health will get you what is required with the number_of_data_nodes field. However if you don’t get to that number in a reasonable amount of time, and want to log an event in Kubernetes stating which node/pod is not coming online correctly, or want to do further action on the particular one that is missing, then you’ll want something more detailed like _nodes

Hope that makes sense?

djschny avatar Oct 13 '17 01:10 djschny

Yes that helps @djschny, thanks again for the feedback!

stevesloka avatar Oct 13 '17 01:10 stevesloka

This would be a great addition. I'm facing non-deterministic issues when launching a fresh cluster. It seems when the data nodes get started before the master nodes the failing health checks let Kubernetes permanently restart the nodes. In such situation, the only way to recover is to recreate the cluster.

sebastianvoss avatar Sep 23 '18 22:09 sebastianvoss

Thanks for the feedback! This is top of my list to be a fully functional operator. Going to try and get this working very soon.

stevesloka avatar Sep 23 '18 23:09 stevesloka