redpanda
redpanda copied to clipboard
raft: amend error code when leadership transfer can't proceed due to recovery
Cover letter
When do_transfer_leadersip(), if a follower is still not caught up after prepare_transfer_leadership() is done, a timeout was returned. However it's not really a timeout, it's a flap (we thought recovery was done but it's not). This commit changes it to exponential_backoff so that admin API would return a 503 (plz retry) for that rather than
a 504 (we couldn't do it in time).
Fixes #6902
This not a fix for the root cause of the issue, only a change of interpretation of the error.
Backport Required
- [ ] not a bug fix
- [ ] issue does not exist in previous branches
- [ ] papercut/not impactful enough to backport
- [x] v22.3.x
- [x] v22.2.x
- [x] v22.1.x
UX changes
- none
Release notes
- none
I amended the title to be a bit more succinct.
Did you mean to mark this as Fixes https://github.com/redpanda-data/redpanda/issues/6902, or are you still looking at the root cause of how we exited recovery with follower stats not up to date?
I was looking into the failure of k8s-operator CI tests (only got limited support from devprod on that), but I was unable to identify the root cause so far. However I strogly think that the failure is unrelated, so retrying...
/backport v22.3.x
/backport v22.2.x
fyi i changed the release notes section of the pr body from none since it looks like this PR changes how an api behaves.
/backport v22.1.x