featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

What happens when the Cluster Coordinator fails?

Open cozos opened this issue 5 years ago • 2 comments

Hello,

What happens when the cluster coordinator node completely dies?

  • Can you replace it with a new cluster coordinator node?
  • What happens to the existing nodes which have gossip seeds pointing to old/dead cluster coordinator node?
  • Is it possible to set up automatic leader election to replace the cluster coordinator instead of doing it "manually"?

cozos avatar Dec 16 '20 17:12 cozos

Hi @cozos, good questions.

If the coordinator dies and cannot be recovered, you will need to do the following:

  • assign another, healthy node to be coordinator (see Changing the Coordinator)
  • remove the dead node from the cluster (see Removing a Node)
  • (optionally) add a new node to the cluster to return the cluster to its original size

Note that these steps require that you have a replication factor of at least 2.

The gossip seed pointing to the old cluster won't be a problem as that is only used during startup. If you restart a node, you'll want to ensure that its seeds configuration contains at least one node which is still available. A good practice in general is to provide more than one node in seeds.

We are working on implementing automatic leader election, but I don't yet have information on when that would be available.

travisturner avatar Dec 18 '20 04:12 travisturner

Thanks @travisturner for the in depth response. I realize now that a lot of these questions are answered in TFA.

Out of curiousity, is there a particular reason automatic leader election is not a thing? Or is it something you guys haven't gotten around to yet?

One last question: is automatic dead node detection (i.e. detect a node is unavailable for a while and rebalance) on the roadmap? If not, any particular reason?

Thanks and feel free to close the issue if you want.

cozos avatar Dec 20 '20 04:12 cozos