ALL_BROKERS_DOWN on Producer

Open dusanu87 opened this issue 1 year ago • 0 comments

Description

We have spotted the following behavior after moving to version 2.5.0. After regular AWS MSK maintenance where all brokers are restarted one by one, we can see in the logs following errors cimpl.KafkaException: KafkaError{code=_ALL_BROKERS_DOWN,val=-187,str="3/3 brokers are down"} and cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="ssl://b-3.XXX.kafka.XXX.amazonaws.com:9094/3: Disconnected (after 1091448ms in state UP)"} This is on the producer side which occasionally occurs even after 2-3 days after broker restart. This behavior is also present during Kubernetes deployment restart when flush is called on the producer side. This issue is only present after the broker restart which happens during regular MSK maintenance. The issue is also present in version 2.4.0. Before this version, we didn't encounter this behavior. Once the application is restarted on K8s, the issue is gone.

How to reproduce

Restart one broker on the AWS MSK cluster(3 brokers) Restart K8s deployment(application with producer side).

Checklist

Please provide the following information:

[ ] confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): (2.5.0) (2.5.0)
[ ] Apache Kafka broker version: (3.5.1)
[ ] Client configuration: { "queue.buffering.max.messages": settings.KAFKA_PRODUCER_QUEUE_COUNT, "queue.buffering.max.kbytes": settings.KAFKA_PRODUCER_QUEUE_BUFF_KBYTES, "linger.ms": settings.KAFKA_PRODUCER_LINGER, "bootstrap.servers": settings.KAFKA_BROKERS, "enable.idempotence": True, "acks": "all", "delivery.timeout.ms": settings.KAFKA_PRODUCER_DELIVERY_TIMEOUT_MS, "security.protocol": "SSL", "error_cb": error_cb, }
[ ] Operating system:
[ ] Provide client logs (with 'debug': '..' as necessary)
[ ] Provide broker log excerpts
[x] Critical issue

Sep 12 '24 12:09 dusanu87

confluent-kafka-python confluent-kafka-python copied to clipboard

ALL_BROKERS_DOWN on Producer

Description

How to reproduce

Checklist

confluent-kafka-python
confluent-kafka-python copied to clipboard