librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

ListOffsets loop of failed requests on leader epoch change until timeout happens

Open emasab opened this issue 1 year ago • 0 comments

Description

ListOffsets requests done for partitions with no committed offsets can be retried indefinitely if that partition leader epoch has changed, because the buffer is retried without recreating it with the new CurrentLeaderEpoch received from the Metadata refresh call.

How to reproduce

Start consuming partitions that have no committed offset, or seek to the latest offset. A partition leader change should happen that changes the current leader epoch to a value higher than the cached one. The ListOffsets request give a FENCED_LEADER_EPOCH and then it refreshes Metadata, but starts retrying the buffer with the same CurrentLeaderEpoch, leading to a loop of failed requests.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • [x] librdkafka version (release number or git tag): 2.1.0+
  • [x] Apache Kafka version: any
  • [x] librdkafka client configuration: any
  • [x] Operating system: any
  • [ ] Provide logs (with debug=.. as necessary) from librdkafka
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

emasab avatar Feb 20 '24 13:02 emasab