confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Producer gets stuck in init_producer_id loop

Open rystsov opened this issue 3 years ago • 2 comments

Description

Observed a pathological behavior which is missing in other clients (java's kafka-clients & franz-go).

With disabled transactions when any client (including confluent_kafka) is about to send init_producer_id it doesn't use txn coordinator, picks a broker randomly and sends a request there. If the request fails resulting in broker returning a retry-able error kafka-clients & franz-go restart the procedure from scratch, pick a random broker again and use it to retry the request while confluent_kafka is reusing the original broker. In case the problem with the broker is permanent kafka-clients & franz-go has a chance to choose a healthy node while confluent_kafka gets stuck.

How to reproduce

I've used Redpanda but based on the source code the same problem affects Kafka too (not sure if Kafka returns retry-able error for init_producer_id but it returns unknown_server_error which confluent_kafka treats the same way and retries). The repro step:

  • have a 3 node cluster
  • isolate a node
  • start producing record
  • if the producer sends init_producer_id to the isolated node it gets stuck

Client logs (docker-rp-4 was isolated):

%4|1651850857.547|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850860.049|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850862.550|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850863.050|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850873.155|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850883.259|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying
%4|1651850886.100|GETPID|rdkafka#producer-7| [thrd:main]: Failed to acquire idempotence PID from broker docker-rp-4:9092/1: Broker: Not coordinator: retrying

Checklist

Please provide the following information:

  • [ ] confluent-kafka-python version: 1.8.2
  • [ ] Apache Kafka broker version: Redpanda v22.1.2
  • [ ] Client configuration: {"bootstrap.servers": ..., "request.required.acks": -1, "retries": 5, "enable.idempotence": True }
  • [ ] Operating system: ubuntu 20.04
  • [ ] Provide client logs
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

rystsov avatar May 07 '22 06:05 rystsov

@edenhill - seems like something we could improve

mhowlett avatar May 13 '22 17:05 mhowlett

Good find! Created upstream librdkafka issue: https://github.com/edenhill/librdkafka/issues/3848

edenhill avatar May 16 '22 10:05 edenhill

tracking this in the above mentioned librdkafka issue.

mhowlett avatar Oct 25 '22 17:10 mhowlett