Why does the Kafka service become unresponsive and cause `read tcp <host>-><host>: i/o timeout`?
Sarama Version: v1.43.3 Kafka Version: v2.7.2 Go Version: v1.21
Hello everyone, I'm encountering an issue where, during the process of pod rolling restarts (i.e., rebalancing), there's a probability of encountering the error read tcp xxx->xxx:9092: i/o timeout in the sarama.ConsumerGroup.Consume function.
I'm aware that it's similar to #1192, but neither in that issue nor after searching online, has there been a concrete explanation of the cause. I'd like to understand the reason. Because there are a few peculiarities about my issue:
- It only occurs in the production environment and cannot be reproduced in the development environment (also because the cause is unknown, it's difficult to reproduce it in a targeted manner).
- Other services that use the same logic for initialization and consumption do not encounter this error. Moreover, this service binds about 30 topics to the same consumer group and has 8 pods.
- This service previously had no issues, but suddenly encountered this problem during a certain release and now encounters this error with a very high probability.
Information that can be provided:
- There are 3 pods, and these 3 pods have been running stably for at least 5 minutes. Then, a new pod is started, and the error begins to occur.
- Each pod has a group named
aaathat consumes 6 topics, and a corresponding client is created for each topic. That is,sarama.NewConsumerGroup()is called for each topic. - There are also about 10 other clients consuming 10 different topics with 10 different group names.
- During startup, multiple consumer groups (
sarama.NewConsumerGroup()) within a single pod are initialized asynchronously and basically start at the same point in time. - In the development environment, increasing the number of pods to 10 and restarting the service multiple times did not result in the error.
- Each topic has 12 partitions.
- During the error period, the Kafka server's CPU and memory usage were relatively low, and the network was stable.
Organized relevant logs:
Apr 14, 2025 @ 14:48:08.361 {"saramaLog":"Connected to broker at xxx:9092 (unregistered)\n"}
Apr 14, 2025 @ 14:48:08.369 {"saramaLog":"client/brokers registered new broker #1 at xxx:9092"}
Apr 14, 2025 @ 14:48:09.210 {"saramaLog":"client/metadata fetching metadata for all topics from broker xxx:9092\n"}
Apr 14, 2025 @ 14:48:09.229 {"saramaLog":"client/metadata fetching metadata for [topic] from broker xxx2:9092\n"}
Apr 14, 2025 @ 14:48:09.239 {"saramaLog":"Connected to broker at xxx:9092 (registered as #1)\n"}
Apr 14, 2025 @ 14:48:39.244 {"saramaLog":"Closed connection to broker xxx:9092\n"}
Apr 14, 2025 @ 14:48:39.345 ERROR read tcp xxx:46394->xxx:9092: i/o timeout
This is the server-side log:
[2025-04-14 14:48:09,243] INFO [GroupCoordinator 1]: Preparing to rebalance group aaa in state PreparingRebalance with old generation 816 (__consumer_offsets-36) (reason: Adding new member aaa-1ed78c04-d326-450d-906f-4d1c0c57a567 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
The groups reporting errors are all aaa, and no errors are reported by other consumer groups. The Sarama Config configuration is as follows, with other configurations set to their default values:
config := sarama.NewConfig()
config.Consumer.Return.Errors = true
config.Version = sarama.V2_5_0_0
config.Consumer.Group.Rebalance.Strategy = sarama.BalanceStrategyRange
config.Consumer.Offsets.Initial = sarama.OffsetOldest
config.Metadata.Timeout = time.Second * 3
From the logs, it can be seen that the connection is closed after 30 seconds. This is likely still during the joinGroup phase, and the disconnection likely occurred because the Kafka server did not respond, and Sarama actively closed the connection after the config.Net.ReadTimeout expired.
But why does the rebalance take more than 30 seconds without a response?
What I know is that during the joinGroup phase, the Kafka server waits for a certain period. If this period exceeds the timeout or all members have joined, it will immediately return the join request and proceed to the syncGroup phase.
This waiting time is configured on the server side as group.max.session.timeout.ms, with a default value of 3 seconds. My production Kafka also uses 3 seconds.
Thank you for your help!
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.
@juwell are you still seeing this problem today or did you resolve it? It certainly does seem like your client was exceeding the read timeout (c.Net.ReadTimeout = 30 * time.Second) talking to that particular broker.
If you have access to the Kafka cluster JMX metrics then you can look at (e.g.,) kafka.network:type=RequestMetrics,name=TotalTimeMs,request=JoinGroup to see if your brokers are congested and struggling to roundtrip JoinGroup requests in time.
Did you also experiment with bumping the ReadTimeout up to (e.g.,) 90 seconds to confirm if waiting longer would have helped and a response was eventually received from the broker, or whether the request seemed to be getting dropped/lost?
Have you solved this problem? I encountered the same problem, and it is very likely to occur. There are many topics under this kafka cluster, but only one topic keeps triggering io/timeout and causing rebalance.