etcd-backup-restore Updated the logics to accommodate with transient quorum loss scenario.

What this PR does / why we need it:

Which issue(s) this PR fixes: Fixes # https://github.com/gardener/etcd-druid/issues/436

Special notes for your reviewer: Recovery from a transient quorum loss is supported by ETCD multinode cluster. But so far, we were extra taking action to restore single node and scale up ETCD cluster. The action was blocking normal path of recovering from transient quorum loss. I updated the logic in this PR in such a way so that recovery from transient quorum loss, single node restoration and scale up.

Following are the details of how I am testing transient quorum loss manually.

A three node cluster is bootstrapped using ETCD CR and druid. Before bootstrap, make sure that the backup bucket is empty.
After the cluster is properly bootstrapped, put some data in ETCD using ETCDCTL PUT command.
Wait few minutes so that incremental snapshot is taken.
Applied a webhook that doesn't let new pods created.
Deleted two pods to trigger quorum loss by Druid.
After waiting some time, removed the webhook.
Two pods joined the cluster sequentially.

More details are in first comment.

While recovering from the quorum loss, ETCD pods joins the cluster sequentially.
Pods join to the cluster sequentially in scale-up case as well.

Sep 01 '22 20:09 abdasgupta

Bootstrap of the three node ETCD cluster:

NAME                          READY   STATUS    RESTARTS   AGE
etcd-druid-649789fc4f-q9kdv   1/1     Running   0          29h
etcd-main-0                   2/2     Running   0          73s
etcd-main-1                   2/2     Running   0          73s
etcd-main-2                   2/2     Running   0          73s

Put some data in one of the member

$ k exec -it etcd-main-0 -c etcd /bin/sh                                                                                                                             ✹
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
/ # etcdctl --cacert=/var/etcd/ssl/client/ca/ca.crt --cert=/var/etcd/ssl/client/client/tls.crt --key=/var/etcd/ssl/client/client/tls.key --endpoints=https://etcd-main-
local:2379 put foo bar

After two pods are deleted from the cluster

NAME                          READY   STATUS    RESTARTS   AGE
etcd-druid-649789fc4f-q9kdv   1/1     Running   0          29h
etcd-main-0                   2/2     Running   0          3m34s

Two pod joins the cluster and recovered from transient quorum loss

NAME                          READY   STATUS    RESTARTS   AGE
etcd-druid-649789fc4f-q9kdv   1/1     Running   0          29h
etcd-main-0                   2/2     Running   0          6m28s
etcd-main-1                   2/2     Running   0          74s
etcd-main-2                   2/2     Running   0          74s

Sep 01 '22 20:09 abdasgupta

@abdasgupta You need rebase this pull request with latest master branch. Please check.

Sep 02 '22 13:09 gardener-robot

With this PR, is #456 still needed?

Sep 06 '22 06:09 aaronfern

While recovering from the quorum loss, all the ETCD pods joins the cluster sequentially.

What I can determine from your logic it is also true for scale-up feature, so I guess you can update the release note.

Sep 09 '22 08:09 ishan16696

@abdasgupta Integration tests failed. Can you please take a look.

Sep 13 '22 12:09 ishan16696

@abdasgupta Thanks for creating the issue to link this PR. Few comments on the overall structure which can help us to align this with other PRs.

What this PR does / why we need it:

Section is empty ?? Can you please elaborate how the PR fixes the https://github.com/gardener/etcd-druid/issues/436#issue-1378486604. The special note for reviewer is same as the motivation part in https://github.com/gardener/etcd-druid/issues/436#issue-1378486604, so it will immensely help the readers of this PR to elaborate the approach used to fix the issue highlighted in the motivation section. Also though this is fresh in everyone's memory on how it solves, but if someone were to look at it lets say 6 months down the line then they will actually read this section to ascertain if they should check the code or not for what they are looking for.

Some reference PRs in multi-node scope where PR description is well captured. https://github.com/gardener/etcd-backup-restore/pull/482 https://github.com/gardener/etcd-backup-restore/pull/532 https://github.com/gardener/etcd-backup-restore/pull/411

Also while reading the release notes, I couldn't get the gist of the what's being introduced with this PR to allow transient quorum loss to recover.

While recovering from the quorum loss, ETCD pods joins the cluster sequentially. Pods join to the cluster sequentially in scale-up case as well.

Which of the above 2 statements highlights what has been introduced new with this PR? I believe only the second one as it says as well. However to me it looked like 2 isolated statements which might be understood by someone involved in the implementation/discussion but may be not to me. @ishan16696, @aaronfern, Do you read/see it differently?

While reading it, I also checked the above listed PR for their release notes and they looked more apt. (but may be that's just me 🤔 ) https://github.com/gardener/etcd-backup-restore/pull/482 https://github.com/gardener/etcd-backup-restore/pull/532 https://github.com/gardener/etcd-backup-restore/pull/509 https://github.com/gardener/etcd-backup-restore/pull/411

Sep 20 '22 05:09 ashwani2k

Which of the above 2 statements highlights what has been introduced new with this PR? I believe only the second one as it says as well. However to me it looked like 2 isolated statements which might be understood by someone involved in the implementation/discussion but may be not to me.

Also, If the answer is both are valid then i'll propose we can combine them.
One more thing which came to me on a reading it again, as we are fixing transient quorum loss so we should highlight this as well in the release notes to avoid confusion with the other story of permanent quorum loss.

Also shouldn't the categorization for release note be bug instead of noteworthy and operator as I believe the operator will have nothing to act on here as this will automatically recover or am I missing something?

-->
```noteworthy operator
While recovering from the quorum loss, ETCD pods joins the cluster sequentially.
Pods join to the cluster sequentially in scale-up case as well.

Sep 20 '22 05:09 ashwani2k

@abdasgupta I guess we can close this PR ?

Oct 13 '22 07:10 ishan16696

We have found that to recover from the transient quorum loss scenario there is no requirement of adding additional logic as PVC/data-dir of each cluster member in the etcd cluster are intact, hence transient quorum loss get recover without need of any manual intervention. To recover from the permanent quorum loss scenario we have added playbook doc here. Please follow the steps mentioned in playbook to recover the cluster from permanent quorum loss.

Hence, I'm closing this PR. /close

Nov 04 '22 17:11 ishan16696

etcd-backup-restore etcd-backup-restore copied to clipboard

Updated the logics to accommodate with transient quorum loss scenario.

etcd-backup-restore
etcd-backup-restore copied to clipboard