etcd-backup-restore icon indicating copy to clipboard operation
etcd-backup-restore copied to clipboard

Do not check SSR State when Snapshotter should be stopped

Open avestuk opened this issue 1 year ago • 5 comments

What this PR does / why we need it:

As reported in #676 we sometimes see that multiple snapshotters can be running at one time in a cluster.

I believe that the reason for this is that the snapshotter is not set to active state until an initial delta snapshot or a fullsnapshot has been taken. If a leadership election takes place while the snapshot is being taken and the current leader loses that election, then handleSsrRequest does not send on the ssrStopCh. This in turn causes ssr.snapshotEventHandler to never return and snapshots would continue to be taken.

Secondly I'm propsing a change to close the ssrStopCh rather than sending on it. This is to make it clear to the reader that the channel is no longer being used at this point.

Lastly the function signature changes are simply adding further type annotations to the channels passed to functions to help readers understand where channels are being read from/sent to.

Which issue(s) this PR fixes: Fixes #676

Special notes for your reviewer:

The reason why the ssrState was being checked is not apparent to me so I'd be delighted to be told why this might be an awful idea.

Release note:


Do not rely on the snapshotter state when stopping the snapshotter. The snapshotter will now always be closed when a member goes from being the leader to any other state. 

avestuk avatar Nov 14 '23 16:11 avestuk

@avestuk Thank you for your contribution.

gardener-robot avatar Nov 14 '23 16:11 gardener-robot

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Nov 14 '23 16:11 CLAassistant

Thank you @avestuk for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

gardener-robot-ci-1 avatar Nov 14 '23 16:11 gardener-robot-ci-1

@ishan16696 Anything I can do to help move this along?

avestuk avatar Jan 02 '24 11:01 avestuk

Hi @avestuk , Sorry for delay. I recognise the importance of the issue, I will try to increase the priority of the issue mentioned by you and analyse it as this PR involves modifications to our legacy code, it warrants a thorough analysis.

ishan16696 avatar Jan 02 '24 13:01 ishan16696

Thanks @avestuk for PR, Overall looks good. I'm just testing out some edge cases. Kindly requesting you to rebase this PR on latest master.

ishan16696 avatar Mar 20 '24 03:03 ishan16696

As requested in this comment: https://github.com/gardener/etcd-backup-restore/pull/680#issuecomment-2008622838, please rebase this PR on latest master.

ishan16696 avatar Mar 23 '24 04:03 ishan16696

As requested in this comment: #680 (comment), please rebase this PR on latest master.

Thanks very much @ishan16696 I've rebased on latest master.

avestuk avatar Mar 25 '24 08:03 avestuk

/hold until #716 is merged.

renormalize avatar Mar 26 '24 05:03 renormalize

/hold until https://github.com/gardener/etcd-backup-restore/pull/716 is merged.

PR #716 will take some time as it also need to resolve merge conflicts, let's not block this PR. Talked to @shreyas-s-rao and he is fine with it.

ishan16696 avatar Mar 26 '24 09:03 ishan16696