elasticsearch-operator Enforce startup sequence

Some users have seen issues where if the masters aren't initialized yet, and data nodes come up, they fail. The operator should handle this for the user and verify each component is healthy before starting the next.

This could be custom logic in the operator or readiness probes.

May 18 '17 13:05 stevesloka

Correct, the desired startup sequence of a cluster would be as follows:

Start master node(s) and wait until the Elasticsearch API says all expected master node(s) are in the cluster
Start data nodes and wait for them to all join the cluster
Start any client nodes (if there are any)

If the entire cluster is being shutdown, then the reverse is to be done. Client nodes, then data nodes, then master nodes.

Jun 07 '17 21:06 djschny

cc @munnerz

Oct 04 '17 15:10 mattbates

Hey @djschny, if we start the master nodes first and the client nodes last, how do we check for cluster health? Right now I can query the clients for cluster health, but in your list it's the last bit.

Do I enable the http interface on all nodes and query passing local=true? (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html)

Oct 11 '17 18:10 stevesloka

Something recently added to k8s in 1.8: Pod Priority and Preemption

Oct 12 '17 21:10 pieterlange

In this scenario for administrative task like this you will want to query the master node(s).

Oct 12 '17 22:10 djschny

http should be enabled on all nodes always

Oct 12 '17 22:10 djschny

Is _cluster/health the right endpoint that you would.use?

Oct 12 '17 23:10 stevesloka

It depends upon what level of error reporting and handling you want to do when something goes wrong. _cluster/health will get you what is required with the number_of_data_nodes field. However if you don’t get to that number in a reasonable amount of time, and want to log an event in Kubernetes stating which node/pod is not coming online correctly, or want to do further action on the particular one that is missing, then you’ll want something more detailed like _nodes

Hope that makes sense?

Oct 13 '17 01:10 djschny

Yes that helps @djschny, thanks again for the feedback!

Oct 13 '17 01:10 stevesloka

This would be a great addition. I'm facing non-deterministic issues when launching a fresh cluster. It seems when the data nodes get started before the master nodes the failing health checks let Kubernetes permanently restart the nodes. In such situation, the only way to recover is to recreate the cluster.

Sep 23 '18 22:09 sebastianvoss

Thanks for the feedback! This is top of my list to be a fully functional operator. Going to try and get this working very soon.

Sep 23 '18 23:09 stevesloka

elasticsearch-operator elasticsearch-operator copied to clipboard

Enforce startup sequence

elasticsearch-operator
elasticsearch-operator copied to clipboard