confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Consumer Pods goes down

Open senthil-orange opened this issue 4 years ago • 2 comments

Description

If we run the Kafka 6 Consumers in cluster with 6 pods in GCP, is there a way to track if any of pods goes down which makes the throughput to bigquery very slow. Once a while we notice two pods goes down and the message lag for that two pods are high, rescaling the pods fix the issue. We need to quit the process when any pods goes down, so that we can restart it. Also do we have any Kafka connection retrun code which we can catch in the try catch ?

Config: self.consumer = Consumer({ 'bootstrap.servers': KAFKA_BOOTSTRAP_SERVER, 'broker.version.fallback': '0.10.0.0', 'api.version.fallback.ms': 0, 'sasl.mechanisms': 'PLAIN', 'security.protocol': 'SASL_SSL', 'sasl.username': KAFKA_USERNAME, 'sasl.password': KAFKA_PASSWORD, 'group.id': KAFKA_GROUP_ID, 'auto.offset.reset': 'earliest', 'enable.auto.offset.store': False, 'ssl.ca.location': KAFKA_SSL_CA_LOCATION, })

senthil-orange avatar Mar 16 '21 18:03 senthil-orange

In the event consumer is unable to contact any of the brokers, an _ALL_BROKERS_DOWN err will be caught. But the client will try to automatically recover. After the retries have been exhausted, you'll need to decide what you want to do at this point. Here is an example for how to catch up this error for python client: https://github.com/confluentinc/confluent-kafka-python/blob/master/tests/test_Consumer.py#L84

jliunyu avatar Mar 15 '22 06:03 jliunyu

Just to clarify; the client will try to reconnect indefinitely, there is no maximum retry count. As the cluster becomes available again the client will be able to connect.

Most issues like this are due to networking issues / containers not being properly routed.

edenhill avatar Mar 30 '22 09:03 edenhill

Closing as question is already answered.

pranavrth avatar Mar 12 '24 11:03 pranavrth