confluent-kafka-python
confluent-kafka-python copied to clipboard
ALL_BROKERS_DOWN on Producer
Description
We have spotted the following behavior after moving to version 2.5.0.
After regular AWS MSK maintenance where all brokers are restarted one by one, we can see in the logs following errors
cimpl.KafkaException: KafkaError{code=_ALL_BROKERS_DOWN,val=-187,str="3/3 brokers are down"}
and
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="ssl://b-3.XXX.kafka.XXX.amazonaws.com:9094/3: Disconnected (after 1091448ms in state UP)"}
This is on the producer side which occasionally occurs even after 2-3 days after broker restart. This behavior is also present during Kubernetes deployment restart when flush is called on the producer side.
This issue is only present after the broker restart which happens during regular MSK maintenance.
The issue is also present in version 2.4.0. Before this version, we didn't encounter this behavior.
Once the application is restarted on K8s, the issue is gone.
How to reproduce
Restart one broker on the AWS MSK cluster(3 brokers) Restart K8s deployment(application with producer side).
Checklist
Please provide the following information:
- [ ] confluent-kafka-python and librdkafka version (
confluent_kafka.version()andconfluent_kafka.libversion()): (2.5.0) (2.5.0) - [ ] Apache Kafka broker version: (
3.5.1) - [ ] Client configuration:
{ "queue.buffering.max.messages": settings.KAFKA_PRODUCER_QUEUE_COUNT, "queue.buffering.max.kbytes": settings.KAFKA_PRODUCER_QUEUE_BUFF_KBYTES, "linger.ms": settings.KAFKA_PRODUCER_LINGER, "bootstrap.servers": settings.KAFKA_BROKERS, "enable.idempotence": True, "acks": "all", "delivery.timeout.ms": settings.KAFKA_PRODUCER_DELIVERY_TIMEOUT_MS, "security.protocol": "SSL", "error_cb": error_cb, } - [ ] Operating system:
- [ ] Provide client logs (with
'debug': '..'as necessary) - [ ] Provide broker log excerpts
- [x] Critical issue