sarama All partitions of a topic execute "retring message" and "refreshmatedata" when their broker is closed

Description

Scenario with a Single Topic When a broker is abruptly terminated due to transient network issues, all partitions hosted on that broker will detect the unavailability of their associated brokerProducer and trigger a reconstruction process. During this recovery phase:

Redundant Metadata Requests:

Each partitionProducer independently requests a refreshMetadata call, leading to duplicate metadata requests across the cluster. This overwhelms the controller and exacerbates latency, especially in large-scale deployments with high partition counts.

Retry Backoff Amplification: Messages destined for these partitions enter a retry loop, enforced by retry.backoff.ms (default: 100ms). Sequential backoff delays compound latency spikes, causing throughput degradation and erratic service response times.

Partition-Scale Sensitivity: While negligible for small partition counts (e.g., <50), the overhead becomes severely pronounced in high-partition environments (e.g., >1000 partitions), where metadata storms and retry queues amplify resource contention

Versions

Sarama	Kafka	Go
v1.45.2	3.6.0	1.24.0

Configuration

Logs

logs: CLICK ME

2025-08-08T01:51:34.274Z	INFO	remote/client.go:52	producer/leader/logflow.cgi.click.nocharge/53 state change to [retrying-1]

2025-08-08T01:51:34.274Z	INFO	remote/client.go:52	producer/leader/logflow.cgi.click.nocharge/53 abandoning broker 2

2025-08-08T01:51:34.375Z	INFO	remote/client.go:52	client/metadata fetching metadata for [logflow.cgi.click.nocharge] from broker b-3.secretlogjoinmsk.5uf3bt.c9.kafka.us-east-1.amazonaws.com:9098

2025-08-08T01:51:34.376Z	INFO	remote/client.go:52	producer/leader/logflow.cgi.click.nocharge/53 selected broker 2

2025-08-08T01:51:34.376Z	INFO	remote/client.go:52	producer/broker/2 state change to [open] on logflow.cgi.click.nocharge/53

2025-08-08T01:51:34.376Z	INFO	remote/client.go:52	producer/leader/logflow.cgi.click.nocharge/53 state change to [flushing-1]

2025-08-08T01:51:34.376Z	INFO	remote/client.go:52	producer/leader/logflow.cgi.click.nocharge/53 state change to [normal]

2025-08-08T01:51:34.377Z	INFO	operator/collect_operator.go:70	-----------duration=102.495618ms-----

Additional Context

Aug 12 '25 15:08 Tangxinqi

Recently, a patch was adopted ( #3225 ) that collects metadata refresh requests, and this was also made default ( #3231 ).

I would advise not using latest as the Sarama version, since that will naturally advance over time, and we won’t know which version you were using at that time. It is also somewhat ambiguous, are you saying you’re using the most recently released tag, or are you saying you’re using a pinned commit from main after the most recently released tag?

PS: The ambiguity is also important, as the so-far most recently released tag is before the above mentioned PRs were merged, while a pinned commit from main might possibly contain the PRs above.

Aug 14 '25 20:08 puellanivis

Recently, a patch was adopted ( #3225 ) that collects metadata refresh requests, and this was also made default ( #3231 ).

I would advise not using latest as the Sarama version, since that will naturally advance over time, and we won’t know which version you were using at that time. It is also somewhat ambiguous, are you saying you’re using the most recently released tag, or are you saying you’re using a pinned commit from main after the most recently released tag?

PS: The ambiguity is also important, as the so-far most recently released tag is before the above mentioned PRs were merged, while a pinned commit from main might possibly contain the PRs above.

oh,sorry, I used v1.45.2

Aug 15 '25 02:08 Tangxinqi

@Tangxinqi if you're able to do some testing after running go get github.com/IBM/sarama@main && go mod tidy to confirm that it resolves these issues for you that'd be very welcome!

Aug 15 '25 09:08 dnwe

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

Nov 13 '25 10:11 github-actions[bot]