redpanda rpk: topic describe/group describe should retry on partition NOT

Background

On a system in a steady state, the current describe commands work fine. But if a partition is going through changes, rpk will tend to get NOT_LEADER errors when it e.g. queries offsets based on stale metadata, and include those errors in the list of partition data.

Problem

On systems with large numbers of partitions:

Recovery from e.g. node outages can take a many minutes, during which there will be some partitions undergoing leadership transfers all the time
If the user sees errors in their describe output, and they run it again, they're likely to get errors on the retry too (but for different partitions).

Suggested change

Keep a list of which partitions couldn't be queried during the describe operation. At the end, refresh metadata and retry querying those particular partitions, after some grace period (let's say two election periods). If they still have an error, then okay, output the error. But they probably won't: the leadership transfer of an individual partition is prompt even at scale.

Aug 04 '22 22:08 jcsp

(confession: I'm partly motivated by wanting to just do fewer retries from my test code that uses rpk on large systems, but I think this is a genuine UX issue too for anyone who's operating systems with lots of partitions.)

Aug 04 '22 22:08 jcsp

When this happens, can you capture the logs with -v?

The client already retries, but only if the most recently loaded metadata is older than 1s. For groups, not_coordinator errors are always retries.

The original behavior of franz-go was to always retry, but this was removed because @weeco noticed in kowl that one down partition could cause the entire request to hang for 30s. The retries were removed with the understanding of "franz-go loads the metadata immediately before listing offsets (or whatever), so if things go stale within milliseconds, we just punt to the user". This was then again changed once I introduced some metadata caching internally: now the client caches metadata, and only retries if the metadata is older than 1s.

Aug 05 '22 14:08 twmb

I'll try... currently this is happening in the middle of big automated runs so it'll be a cause of pausing something on a large system that's in a wonky state and grabbing log at just the right time.

Aug 05 '22 17:08 jcsp

Another alternative is to add a MetadataMinAge(10*time.Millisecond) and see if that makes this more rare (just checked, the internal lower limit is 10ms)

Aug 05 '22 21:08 twmb

When addressing this, would be good to consult the ways in which the test code currently wraps rpk to retry: https://github.com/redpanda-data/redpanda/pull/6175

make them unnecessary.

Aug 23 '22 08:08 jcsp

This will be closed in v23.2, per the franz-go dep bump to https://github.com/twmb/franz-go/blob/master/CHANGELOG.md#v1133, which introduced "always retry on NotLeader for sharded requests"

May 26 '23 00:05 twmb

redpanda
redpanda copied to clipboard

rpk: topic describe/group describe should retry on partition NOT_LEADER errors

Background

Problem

Suggested change

redpanda redpanda copied to clipboard

rpk: topic describe/group describe should retry on partition NOT_LEADER errors

Background

Problem

Suggested change

redpanda
redpanda copied to clipboard