sarama
sarama copied to clipboard
During network partition drill, producer keeps trying to contact an unreachable broker
Versions
Sarama | Kafka | Go |
---|---|---|
v1.27.2 | 2.6.3 | 1.16 |
Configuration
What configuration values are you using for Sarama and Kafka?
c.Net.DialTimeout = 10 * time.Second
c.Net.ReadTimeout = 10 * time.Second
c.Net.WriteTimeout = 10 * time.Second
c.ChannelBufferSize = 128
c.Producer.Retry.Max = 5
c.Producer.Retry.Backoff = 200 * time.Millisecond
c.Net.KeepAlive = 60 * time.Second
c.Producer.Return.Successes = true
c.Version = sarama.V2_1_0_0
c.Metadata.Timeout = 20 * time.Second
Problem Description
This happened in the context of a network drill: traffic to/from a set of brokers was blackholed (via iptables
). At the Kafka cluster level everything went as expected (new leaders elected to replace the isolated brokers).
In one of the programs using Sarama to produce to the cluster, we saw constant failure to produce. During post-mortem, I saw a stream of this messages in the logs (roughly one every 2 seconds)
[Sarama] 2022/02/10 14:19:22 client/metadata fetching metadata for [<some topic>] from broker isolated-broker:9092
Where <some topic>
was variable, and isolated-broker
was always the same host, and one of the brokers isolated for the drill. This went on for ~30 minutes, always trying to contact the same isolated broker, until the drill was interrupted.
What I would have expected is to see:
- metadata request failing after a timeout
- client code cycling through all seed brokers when trying to fetch metadata
As far as I can tell, the issue is that metadata requests never timed out, and as a result deregisterBroker()
was not called, and any()
kept returning the first broker in the list, which happened to be the isolated broker.
I think the underlying issue is that Broker.write()
calls SetWriteDeadline()
on every write, and if writes are issued often enough, the repeated calling of SetWriteDeadline()
will prevent any of the writes from timing out. This makes config.Net.WriteTimeout
a no-op for this case. Now, since writes never complete (either with success or failure) the code trying to fetch the metadata is stuck in the send, which means the code that should evaluate metadata fetch timeout is not being executed.
@mbarbon thanks for this bug report, is your network partition drill easy for you to re-run? It would be great if you could test the latest release of Sarama (v1.33.0) as we have made a few fixes in this area since v1.27.2
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.