etcd-backup-restore
etcd-backup-restore copied to clipboard
Restoration of a member in a multi-node cluster
Describe the bug:
If an existing etcd member crashed and now has come up again, then if the data directory is not longer valid then for a multi-node setup, the data directory is removed and only limited number of attempts are made to add as learner. Now consider a case where more than 1 member goes down and both are trying to recover (in a 5 member cluster). The quorum is still there so it can happen that both of the member attempt to add themselves as learners and one of them will fail.
Expected behavior: In scale-up case where adding the current candidate as a learner is repeatedly attempted (upto 6 times). Similar thing should also be done when a restoration of a member in a multi-node cluster requires it to be added as a learner.