Description

This PR prevents ERS and PRS from promoting hosts that are currently taking backups.

The implementation follows what was suggesteed in the RFC of the issue (link below). Namely, the RPCs used to get information about candidates now include an extra field indicating whether they are running backups or not; and that is used to filter out the list of valid candidates.

Related Issue(s)

https://github.com/vitessio/vitess/issues/16558

Checklist

[x] "Backport to:" labels have been added if this change should be back-ported to release branches
[x] If this change is to be back-ported to previous releases, a justification is included in the PR description
[x] Tests were added or are not required
[x] Did the new or modified tests pass consistently locally and on CI?
[x] Documentation was added or is not required

Deployment Notes

N/A

Oct 18 '24 10:10 ejortegau

Review Checklist

Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.

General

[ ] Ensure that the Pull Request has a descriptive title.
[ ] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

[ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

[ ] Apply the release notes (needs details) label if users need to know about this change.
[ ] New features should be documented.
[ ] There should be some code comments as to why things are implemented the way they are.
[ ] There should be a comment at the top of each new or modified test to explain what the test does.

New flags

[ ] Is this flag really necessary?
[ ] Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

[ ] Each item in Jobs should be named in order to mark it as required.
[ ] If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

[ ] Protobuf changes should be wire-compatible.
[ ] Changes to _vt tables and RPCs need to be backward compatible.
[ ] RPC changes should be compatible with vitess-operator
[ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
[ ] vtctl command output order should be stable and awk-able.

Oct 18 '24 10:10 vitess-bot[bot]

Codecov Report

Attention: Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.

Project coverage is 67.32%. Comparing base (469bdcc) to head (97c0271). Report is 31 commits behind head on main.

Files with missing lines	Patch %	Lines
go/vt/vttablet/tabletmanager/rpc_backup.go	0.00%	2 Missing :warning:
go/vt/vtctl/reparentutil/util.go	92.30%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16997      +/-   ##
==========================================
+ Coverage   67.31%   67.32%   +0.01%     
==========================================
  Files        1569     1570       +1     
  Lines      252502   252762     +260     
==========================================
+ Hits       169964   170182     +218     
- Misses      82538    82580      +42

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

Oct 18 '24 12:10 codecov[bot]

@ejortegau I think it would be safer to "prefer" not to promote a REPLICA performing backup vs wholly exclude it, the reason is I feel promoting a tablet that is backing-up to PRIMARY (if no other candidates are available) is better than doing nothing - which is a 100% outage to writes

If we are down to 1 candidate, still being able to write feels better to me, even if the node is busy with backups as well

Oct 21 '24 17:10 timvaillancourt

Another approach could be to cancel the backup in order to enable promotion, if having backups delayed is preferable to being unavailable for writes for a long time.

Oct 21 '24 17:10 GrahamCampbell

@ejortegau I think it would be safer to "prefer" not to promote a REPLICA performing backup vs wholly exclude it, the reason is I feel promoting a tablet that is backing-up to PRIMARY (if no other candidates are available) is better than doing nothing - which is a 100% outage to writes

If we are down to 1 candidate, still being able to write feels better to me, even if the node is busy with backups as well

I think we should consider PRS and ERS separately for this discussion. Assume you have 1 primary and 1 replica, which is taking a backup:

PRS scenario

Operator or automation calls PRS. Our choices are to either fail it, or promote the replica that is running a backup. IMHO, it's better that the operator/automation that called PRS is made aware of the fact that the host that would be promoted is running a backup (by failing it), instead of blindly promoting it without the operator/automation knowing it. In that case, the operator can decide to either cancel the backup, and retry to PRS, or to wait until it is completed and then retry to PRS.

ERS scenario

Operator or automation (including vtorc) decided that the primary has failed and call ERS. There's only one replica. IMHO, the ERS should fail - there will be no healthy replicas afterwards, which will play poorly with semi-sync, unless you don't care about durability. So again, I think the case of only one host left which is taking backup should fail.

Thoughts, @timvaillancourt ?

Oct 22 '24 13:10 ejortegau

IMHO, it's better that the operator/automation that called PRS is made aware of the fact that the host that would be promoted is running a backup (by failing it)

@ejortegau I agree here, I think that would be best 👍. The "planned" aspect should make this safe to fail

Operator or automation (including vtorc) decided that the primary has failed and call ERS. There's only one replica. IMHO, the ERS should fail - there will be no healthy replicas afterwards, which will play poorly with semi-sync, unless you don't care about durability. So again, I think the case of only one host left which is taking backup should fail.

I disagree here, I think the goal or VTOrc's ERS should always be availability

In the ERS scenario, I consider no promotion to be much worse than promoting a node that may or may not be impacted by backups running - in many cases users performing online backups will be able to serve primary traffic and I don't see a reason to leave the shard broken. Is there a benefit to leaving the shard in this state that I'm missing @ejortegau?

cc @GuptaManan100 for input on this area

Oct 22 '24 18:10 timvaillancourt

I think we should still allow REPLICAs running backups to be promoted, they should just be less preferred. We can do this by changing the sorting logic we use for the tablets in both the promotions (ERS and PRS). If user wants a tablet that is running the backup to not be promoted, they should just change its type to BACKUP when they start the backup, then it will never be considered for promotion.

Oct 23 '24 08:10 GuptaManan100

@timvaillancourt & @GuptaManan100 , I have updated the PR so that it works in a "prefer not" way, please have another look. I also added relevant files to Changelog. Please have another look.

Oct 31 '24 10:10 ejortegau

@timvaillancourt & @GuptaManan100 , I have updated the PR so that it works in a "prefer not" way, please have another look. I also added relevant files to Changelog. Please have another look.

LGTM! Thanks

Oct 31 '24 14:10 timvaillancourt

@GuptaManan100 's suggestion has been applied, please have another look

Nov 04 '24 14:11 ejortegau

@GuptaManan100 / @ejortegau: I can't see why this would change this behaviour, but I wanted to double check that this ERS behaviour remains true after this change:

Wait for the primary-elect to catch up to the most advanced replica, if it isn't already the most advanced.

To be specific, if the backing-up REPLICA is the most up-to-date in replication (we've had this happen with xtrabackup somehow), will the primary-elect catch up to the backing-up REPLICA that we likely ignored as a candidate?

And could any of the backup engines prevent this from happening? For example, some engines take out some heavy mysql-level locks 🤔

Nov 05 '24 22:11 timvaillancourt

The backuping replicas are still viable candidates. We don't remove them from the list even now. It is just an extra piece of information to use to sort the tablets. If we have 2 equally advanced tablets (in terms of position), then the one taking the backup will be lower down the list of candidates. That being said, if a backing up replica is the most advanced tablet there is, we will still promote it.

Nov 06 '24 05:11 GuptaManan100

That being said, if a backing up replica is the most advanced tablet there is, we will still promote it.

@GuptaManan100 at least in our production this would be undesirable. We had this happen on one shard (the backup node was most advanced, surprisingly) and the overhead of the backup was enough to impact primary operations

I think it would be more ideal for the 2nd-most-current replica to be promoted in this scenario and whatever magic we have to "catch up to the most advanced replica" would be used to catch up to the backing up node

Without this behaviour I think we are open to the same scenario/risks that prompted this support to be added. cc @ejortegau for thoughts here

Nov 06 '24 20:11 timvaillancourt

If backup operations impact the tablet from being a primary at all, then we should then decide to never promote it. We can do that by filtering the candidates out in filterValidCandidates. We can remove the sorting changes as well. Just removing these tablets would be enough.

If we make this change, then the tablet will be used to get another tablet caught up, but it won't be elected itself. However, with these changes, if we find ourselves in a situation such that the only viable tablet is taking a backup, we will not promote it.

Nov 07 '24 04:11 GuptaManan100

I think this falls back to what behavior we want for the two different cases. Using filterValidCandates would work fine for ERS and is in fact what the original version of the PR used in that case.

This leaves open what should be done for PRS, since PRS logic does not have ideal source selection after picking up the most up to date one as far as I can see. We could do what I proposed here for that, i.e. fail the PRS with an error that allows the operator to decide whether they will cancel the backup or pick a different source.

Thoughts?

Nov 07 '24 15:11 ejortegau

@ejortegau Yes, we can go back to the original version, I think that would do what we want it to.

As far as PRS is concerned, the primary is online, there is no need for a multi-step promotion. Any candidate that we select, can be promoted. We wait for it to catch up to the primary. So, in PRS, it would be enough to remove these tablets from consideration in ElectNewPrimary function.

Nov 08 '24 07:11 GuptaManan100

I have updated the PR to do the following:

In the ERS case, filterValidCandidates() does not return hosts taking backups, unless there are no better candidates not taking backups. This means ERS can promote hosts taking backups, but only if there are no other candidates that are not.
In the PRS case, ElectNewPrimary() excludes hosts taking backups. This means PRS can fail if there's only one good candidate and it's taking a backup.

Please have another look.

Nov 08 '24 16:11 ejortegau

Thanks for the contribution and for your patience with the review process.

Nov 21 '24 04:11 deepthi

vitess
vitess copied to clipboard

PRS and ERS don't promote replicas taking backups

Description

Related Issue(s)

Checklist