ra
ra copied to clipboard
Unsafe recovery option
Provide an option to force a ra server to start as a single node cluster in the case where a quorum cannot ever be re-established and we just want to recover what is left.
We've had some discussion notes about this and compared what Ra has today to other Raft-based systems, namely etcd and Consul.
Links
etc
- Permanent loss of cluster quorum requires a new cluster, by design
- Data safety is quoted as the primary reason for that
- Node recovery is performed using a snapshot file or an existing node data directory
- A node can be (forced to boot): cluster size shrinks to one
- Keyspace data can be preserved for a node started with new configuration
- A node can be forced to join a cluster and forget about its previous one
Consul
- Cluster membership uses a gossip protocol
- Recreated failed nodes (e.g. replacement pods) must retain node identity, which is derived from the IP address
- A node can be forced to boot (bootstrap): cluster size shrinks to one
- A node can be forcefully removed from the cluster
- One of the recommendations suggests recovering a single node by forcing it to boot with only one known cluster member, and making a number of brand new nodes join it
What Do We Want to Have in Ra
- Forced boot option
- Other ideas are out of scope for now
This was done in #306