redpanda
redpanda copied to clipboard
rpk: topic describe/group describe should retry on partition NOT_LEADER errors
Background
On a system in a steady state, the current describe
commands work fine. But if a partition is going through changes, rpk will tend to get NOT_LEADER errors when it e.g. queries offsets based on stale metadata, and include those errors in the list of partition data.
Problem
On systems with large numbers of partitions:
- Recovery from e.g. node outages can take a many minutes, during which there will be some partitions undergoing leadership transfers all the time
- If the user sees errors in their
describe
output, and they run it again, they're likely to get errors on the retry too (but for different partitions).
Suggested change
Keep a list of which partitions couldn't be queried during the describe operation. At the end, refresh metadata and retry querying those particular partitions, after some grace period (let's say two election periods). If they still have an error, then okay, output the error. But they probably won't: the leadership transfer of an individual partition is prompt even at scale.
(confession: I'm partly motivated by wanting to just do fewer retries from my test code that uses rpk on large systems, but I think this is a genuine UX issue too for anyone who's operating systems with lots of partitions.)
When this happens, can you capture the logs with -v
?
The client already retries, but only if the most recently loaded metadata is older than 1s. For groups, not_coordinator errors are always retries.
The original behavior of franz-go was to always retry, but this was removed because @weeco noticed in kowl that one down partition could cause the entire request to hang for 30s. The retries were removed with the understanding of "franz-go loads the metadata immediately before listing offsets (or whatever), so if things go stale within milliseconds, we just punt to the user". This was then again changed once I introduced some metadata caching internally: now the client caches metadata, and only retries if the metadata is older than 1s.
I'll try... currently this is happening in the middle of big automated runs so it'll be a cause of pausing something on a large system that's in a wonky state and grabbing log at just the right time.
Another alternative is to add a MetadataMinAge(10*time.Millisecond)
and see if that makes this more rare (just checked, the internal lower limit is 10ms)
When addressing this, would be good to consult the ways in which the test code currently wraps rpk to retry: https://github.com/redpanda-data/redpanda/pull/6175
- make them unnecessary.
This will be closed in v23.2, per the franz-go dep bump to https://github.com/twmb/franz-go/blob/master/CHANGELOG.md#v1133, which introduced "always retry on NotLeader for sharded requests"