kafka-python Add a test that when groupcoordinator dies, the consumer will pick up the new coordinator

I just had a failure case reported at work where a service endlessly spun:

2017-06-29 08:50:11,407 WARNING         base                    __call__:661    10627   139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,407 WARNING         base                    _handle_heartbeat_failure:692   10627   139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
2017-06-29 08:50:11,508 WARNING         base                    __call__:661    10627   139637316585216 Coordinator unknown during heartbeat -- will retry
2017-06-29 08:50:11,508 WARNING         base                    _handle_heartbeat_failure:692   10627   139637316585216 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying

Normally this indicates a cluster failure. However, from the ticket description it appears the cluster became healthy again but the consumer never recovered and just kept returning this message for half an hour. Restarting the process immediately fixed the issue.

I wasn't directly involved, I was just called in as the Kafka expert after the fact, so this will likely be impossible to verify that the cluster was fully healthy.

However, we should have an end-to-end test of this scenario that brings up a cluster and consumer group with two processes, kills the broker that is the group coordinator, and verifies that the consumers rejoin successfully once the cluster moves the coordinator to a new broker.

Jun 30 '17 19:06 jeffwidman

I'm investigating a similar issue. At some point our Kafka consumer started logging the error messages below every 100ms, and it never successfully reconnected (we verified that the Kafka cluster was healthy). After being restarted, the Kafka consumer connected successfully.

Jul 19 18:09:17 2017-07-19 18:09:17,739 kafka.coordinator    WARNING  /var/www/mb/env/lib/python3.5/site-packages/kafka/coordinator/base.py:638 Coordinator unknown during heartbeat -- will retry
Jul 19 18:09:17 2017-07-19 18:09:17,739 kafka.coordinator    WARNING  /var/www/mb/env/lib/python3.5/site-packages/kafka/coordinator/base.py:669 Heartbeat failed ([Error 15] GroupCoordinatorNotAvailableError); retrying
Jul 19 18:09:17 2017-07-19 18:09:17,780 kafka.conn           ERROR    /var/www/mb/env/lib/python3.5/site-packages/kafka/conn.py:603 <BrokerConnection host=xxxxx.xxxx.xx/xx.xx.xx.xxx port=9092>: socket disconnected

Jul 20 '17 16:07 markerdmann

There were a few bugfixes to group coordinator reconnects in 1.3.3. First thing to do if you see an error like this is verify you are on latest release (@markerdmann, the line numbers in your logs suggest you're on 1.3.2). That doesn't mean this isn't still an issue on master though!

Jul 20 '17 17:07 dpkp

Thanks @dpkp. We'll upgrade to 1.3.3 and keep an eye on the logs. I'll let you know if we see the same issue again.

Jul 20 '17 17:07 markerdmann