What happens when the Cluster Coordinator fails?
Hello,
What happens when the cluster coordinator node completely dies?
- Can you replace it with a new cluster coordinator node?
- What happens to the existing nodes which have gossip seeds pointing to old/dead cluster coordinator node?
- Is it possible to set up automatic leader election to replace the cluster coordinator instead of doing it "manually"?
Hi @cozos, good questions.
If the coordinator dies and cannot be recovered, you will need to do the following:
- assign another, healthy node to be coordinator (see Changing the Coordinator)
- remove the dead node from the cluster (see Removing a Node)
- (optionally) add a new node to the cluster to return the cluster to its original size
Note that these steps require that you have a replication factor of at least 2.
The gossip seed pointing to the old cluster won't be a problem as that is only used during startup. If you restart a node, you'll want to ensure that its seeds configuration contains at least one node which is still available. A good practice in general is to provide more than one node in seeds.
We are working on implementing automatic leader election, but I don't yet have information on when that would be available.
Thanks @travisturner for the in depth response. I realize now that a lot of these questions are answered in TFA.
Out of curiousity, is there a particular reason automatic leader election is not a thing? Or is it something you guys haven't gotten around to yet?
One last question: is automatic dead node detection (i.e. detect a node is unavailable for a while and rebalance) on the roadmap? If not, any particular reason?
Thanks and feel free to close the issue if you want.