elasticsearch-operator
elasticsearch-operator copied to clipboard
Enforce startup sequence
Some users have seen issues where if the masters aren't initialized yet, and data nodes come up, they fail. The operator should handle this for the user and verify each component is healthy before starting the next.
This could be custom logic in the operator or readiness probes.
Correct, the desired startup sequence of a cluster would be as follows:
- Start master node(s) and wait until the Elasticsearch API says all expected master node(s) are in the cluster
- Start data nodes and wait for them to all join the cluster
- Start any client nodes (if there are any)
If the entire cluster is being shutdown, then the reverse is to be done. Client nodes, then data nodes, then master nodes.
cc @munnerz
Hey @djschny, if we start the master nodes first and the client nodes last, how do we check for cluster health? Right now I can query the clients for cluster health, but in your list it's the last bit.
Do I enable the http interface on all nodes and query passing local=true
? (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html)
Something recently added to k8s in 1.8: Pod Priority and Preemption
In this scenario for administrative task like this you will want to query the master node(s).
http should be enabled on all nodes always
Is _cluster/health the right endpoint that you would.use?
It depends upon what level of error reporting and handling you want to do when something goes wrong. _cluster/health
will get you what is required with the number_of_data_nodes
field. However if you don’t get to that number in a reasonable amount of time, and want to log an event in Kubernetes stating which node/pod is not coming online correctly, or want to do further action on the particular one that is missing, then you’ll want something more detailed like _nodes
Hope that makes sense?
Yes that helps @djschny, thanks again for the feedback!
This would be a great addition. I'm facing non-deterministic issues when launching a fresh cluster. It seems when the data nodes get started before the master nodes the failing health checks let Kubernetes permanently restart the nodes. In such situation, the only way to recover is to recreate the cluster.
Thanks for the feedback! This is top of my list to be a fully functional operator. Going to try and get this working very soon.