etcd-backup-restore Restoration of a member in a multi-node cluster

Restoration of a member in a multi-node cluster

Open unmarshall opened this issue 1 year ago • 4 comments

Describe the bug:

If an existing etcd member crashed and now has come up again, then if the data directory is not longer valid then for a multi-node setup, the data directory is removed and only limited number of attempts are made to add as learner. Now consider a case where more than 1 member goes down and both are trying to recover (in a 5 member cluster). The quorum is still there so it can happen that both of the member attempt to add themselves as learners and one of them will fail.

Expected behavior: In scale-up case where adding the current candidate as a learner is repeatedly attempted (upto 6 times). Similar thing should also be done when a restoration of a member in a multi-node cluster requires it to be added as a learner.

Jul 14 '23 05:07 unmarshall

etcd-backup-restore etcd-backup-restore copied to clipboard

Restoration of a member in a multi-node cluster

etcd-backup-restore
etcd-backup-restore copied to clipboard