All partitions of a topic execute "retring message" and "refreshmatedata" when their broker is closed
Description
Scenario with a Single Topic When a broker is abruptly terminated due to transient network issues, all partitions hosted on that broker will detect the unavailability of their associated brokerProducer and trigger a reconstruction process. During this recovery phase:
Redundant Metadata Requests:
Each partitionProducer independently requests a refreshMetadata call, leading to duplicate metadata requests across the cluster. This overwhelms the controller and exacerbates latency, especially in large-scale deployments with high partition counts.
Retry Backoff Amplification: Messages destined for these partitions enter a retry loop, enforced by retry.backoff.ms (default: 100ms). Sequential backoff delays compound latency spikes, causing throughput degradation and erratic service response times.
Partition-Scale Sensitivity: While negligible for small partition counts (e.g., <50), the overhead becomes severely pronounced in high-partition environments (e.g., >1000 partitions), where metadata storms and retry queues amplify resource contention
Versions
| Sarama | Kafka | Go |
|---|---|---|
| v1.45.2 | 3.6.0 | 1.24.0 |
Configuration
Logs
logs: CLICK ME
2025-08-08T01:51:34.274Z INFO remote/client.go:52 producer/leader/logflow.cgi.click.nocharge/53 state change to [retrying-1]
2025-08-08T01:51:34.274Z INFO remote/client.go:52 producer/leader/logflow.cgi.click.nocharge/53 abandoning broker 2
2025-08-08T01:51:34.375Z INFO remote/client.go:52 client/metadata fetching metadata for [logflow.cgi.click.nocharge] from broker b-3.secretlogjoinmsk.5uf3bt.c9.kafka.us-east-1.amazonaws.com:9098
2025-08-08T01:51:34.376Z INFO remote/client.go:52 producer/leader/logflow.cgi.click.nocharge/53 selected broker 2
2025-08-08T01:51:34.376Z INFO remote/client.go:52 producer/broker/2 state change to [open] on logflow.cgi.click.nocharge/53
2025-08-08T01:51:34.376Z INFO remote/client.go:52 producer/leader/logflow.cgi.click.nocharge/53 state change to [flushing-1]
2025-08-08T01:51:34.376Z INFO remote/client.go:52 producer/leader/logflow.cgi.click.nocharge/53 state change to [normal]
2025-08-08T01:51:34.377Z INFO operator/collect_operator.go:70 -----------duration=102.495618ms-----
Additional Context
Recently, a patch was adopted ( #3225 ) that collects metadata refresh requests, and this was also made default ( #3231 ).
I would advise not using latest as the Sarama version, since that will naturally advance over time, and we won’t know which version you were using at that time. It is also somewhat ambiguous, are you saying you’re using the most recently released tag, or are you saying you’re using a pinned commit from main after the most recently released tag?
PS: The ambiguity is also important, as the so-far most recently released tag is before the above mentioned PRs were merged, while a pinned commit from main might possibly contain the PRs above.
Recently, a patch was adopted ( #3225 ) that collects metadata refresh requests, and this was also made default ( #3231 ).
I would advise not using
latestas the Sarama version, since that will naturally advance over time, and we won’t know which version you were using at that time. It is also somewhat ambiguous, are you saying you’re using the most recently released tag, or are you saying you’re using a pinned commit from main after the most recently released tag?PS: The ambiguity is also important, as the so-far most recently released tag is before the above mentioned PRs were merged, while a pinned commit from main might possibly contain the PRs above.
oh,sorry, I used v1.45.2
@Tangxinqi if you're able to do some testing after running go get github.com/IBM/sarama@main && go mod tidy to confirm that it resolves these issues for you that'd be very welcome!
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.