Fetching watermark when some partition leader is not available
Hello. I am reporting a minor bug I encountered while using Kminion.
Bug Scenario
- A topic with
replication.factorset to 1, or a partition that only has one ISR left - A broker hosting the leader of the above topic-partition goes down, causing a LEADER_NOT_AVAILABLE error for the relevant topic-partition
- This issue prevents the collection of metrics for not just the affected partition, but for all topic-partitions across the cluster
Suspected Cause
It seems that the ListOffsets function in the minion/list_offsets.go file is the culprit. There appears to be a slight issue in the code that sends requests and handles errors.
From what I've observed, the RequestsWith function from franz-go used here returns the first error it encounters when processing bulk requests. This means that an error return by RequestsWith does not necessarily imply that the entire request has failed.
Due to this, if an error is returned immediately upon encountering an error in the RequestsWith function, the error handling code for individual topic-partitions is not executed, and the metrics for all topic-partitions are not collected.
In my case, commenting out the part where the error is returned resolved the issue and allowed for normal operations.
Please review this issue. Thank you.
Thanks for filing this issue. I think your analysis is correct and I filed a PR for this