Add a test that when groupcoordinator dies, the consumer will pick up the new coordinator
I just had a failure case reported at work where a service endlessly spun:
2017-06-29 08:50:11,407 WARNING base __call__:661 10627 139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,407 WARNING base _handle_heartbeat_failure:692 10627 139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
2017-06-29 08:50:11,508 WARNING base __call__:661 10627 139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,508 WARNING base _handle_heartbeat_failure:692 10627 139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
Normally this indicates a cluster failure. However, from the ticket description it appears the cluster became healthy again but the consumer never recovered and just kept returning this message for half an hour. Restarting the process immediately fixed the issue.
I wasn't directly involved, I was just called in as the Kafka expert after the fact, so this will likely be impossible to verify that the cluster was fully healthy.
However, we should have an end-to-end test of this scenario that brings up a cluster and consumer group with two processes, kills the broker that is the group coordinator, and verifies that the consumers rejoin successfully once the cluster moves the coordinator to a new broker.
I'm investigating a similar issue. At some point our Kafka consumer started logging the error messages below every 100ms, and it never successfully reconnected (we verified that the Kafka cluster was healthy). After being restarted, the Kafka consumer connected successfully.
Jul 19 18:09:17 2017-07-19 18:09:17,739 kafka.coordinator WARNING /var/www/mb/env/lib/python3.5/site-packages/kafka/coordinator/base.py:638 Coordinator unknown during heartbeat -- will retry
Jul 19 18:09:17 2017-07-19 18:09:17,739 kafka.coordinator WARNING /var/www/mb/env/lib/python3.5/site-packages/kafka/coordinator/base.py:669 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
Jul 19 18:09:17 2017-07-19 18:09:17,780 kafka.conn ERROR /var/www/mb/env/lib/python3.5/site-packages/kafka/conn.py:603 <BrokerConnection host=xxxxx.xxxx.xx/xx.xx.xx.xxx port=9092>: socket disconnected
There were a few bugfixes to group coordinator reconnects in 1.3.3. First thing to do if you see an error like this is verify you are on latest release (@markerdmann, the line numbers in your logs suggest you're on 1.3.2). That doesn't mean this isn't still an issue on master though!
Thanks @dpkp. We'll upgrade to 1.3.3 and keep an eye on the logs. I'll let you know if we see the same issue again.