ra icon indicating copy to clipboard operation
ra copied to clipboard

Unsafe recovery option

Open kjnilsson opened this issue 5 years ago • 1 comments

Provide an option to force a ra server to start as a single node cluster in the case where a quorum cannot ever be re-established and we just want to recover what is left.

kjnilsson avatar Feb 14 '20 11:02 kjnilsson

We've had some discussion notes about this and compared what Ra has today to other Raft-based systems, namely etcd and Consul.

Links

etc

  • Permanent loss of cluster quorum requires a new cluster, by design
  • Data safety is quoted as the primary reason for that
  • Node recovery is performed using a snapshot file or an existing node data directory
  • A node can be (forced to boot): cluster size shrinks to one
  • Keyspace data can be preserved for a node started with new configuration
  • A node can be forced to join a cluster and forget about its previous one

Consul

  • Cluster membership uses a gossip protocol
  • Recreated failed nodes (e.g. replacement pods) must retain node identity, which is derived from the IP address
  • A node can be forced to boot (bootstrap): cluster size shrinks to one
  • A node can be forcefully removed from the cluster
  • One of the recommendations suggests recovering a single node by forcing it to boot with only one known cluster member, and making a number of brand new nodes join it

What Do We Want to Have in Ra

  • Forced boot option
  • Other ideas are out of scope for now

michaelklishin avatar Jul 20 '22 14:07 michaelklishin

This was done in #306

kjnilsson avatar Jan 12 '24 09:01 kjnilsson