redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

raft: amend error code when leadership transfer can't proceed due to recovery

Open dlex opened this issue 3 years ago • 1 comments

Cover letter

When do_transfer_leadersip(), if a follower is still not caught up after prepare_transfer_leadership() is done, a timeout was returned. However it's not really a timeout, it's a flap (we thought recovery was done but it's not). This commit changes it to exponential_backoff so that admin API would return a 503 (plz retry) for that rather than a 504 (we couldn't do it in time).

Fixes #6902

This not a fix for the root cause of the issue, only a change of interpretation of the error.

Backport Required

  • [ ] not a bug fix
  • [ ] issue does not exist in previous branches
  • [ ] papercut/not impactful enough to backport
  • [x] v22.3.x
  • [x] v22.2.x
  • [x] v22.1.x

UX changes

  • none

Release notes

  • none

dlex avatar Nov 15 '22 23:11 dlex

I amended the title to be a bit more succinct.

Did you mean to mark this as Fixes https://github.com/redpanda-data/redpanda/issues/6902, or are you still looking at the root cause of how we exited recovery with follower stats not up to date?

jcsp avatar Nov 16 '22 20:11 jcsp

I was looking into the failure of k8s-operator CI tests (only got limited support from devprod on that), but I was unable to identify the root cause so far. However I strogly think that the failure is unrelated, so retrying...

dlex avatar Nov 21 '22 22:11 dlex

/backport v22.3.x

dlex avatar Nov 22 '22 16:11 dlex

/backport v22.2.x

dlex avatar Nov 22 '22 21:11 dlex

fyi i changed the release notes section of the pr body from none since it looks like this PR changes how an api behaves.

andrewhsu avatar Nov 23 '22 14:11 andrewhsu

/backport v22.1.x

dlex avatar Nov 24 '22 23:11 dlex