confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Kafka producer does not report _ISR_INSUFF as error code when ISR count is less; instead reports _MSG_TIMED_OUT

Open prashantochatterjee opened this issue 1 year ago • 3 comments

Description

The issue is seen when a node is brought down in a 3-node cluster wherein one of the topic partitions had an ISR count of 1 against a min-ISR setting of 2. The Kafka producer client is configured to report errors using error_cb callback setting. It is observed that the error code is reported as _MSG_TIMED_OUT instead of a more intuitive _ISR_INSUFF leading me to believe that a message was being sent to the node that was down.

How to reproduce

  • Bring down a node in a minimal 3 node cluster and to a state where the ISR count for one/more topic partitions is less than the minimum configured ISR setting for the topic/broker.
  • Create a producer with acks=all setting
  • Try to produce messages to a topic-partition with the min-ISR insufficiency

Checklist

Please provide the following information:

  • [x] confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):

confluent_kafka.version() ('2.3.0', 33751040) confluent_kafka.libversion() ('2.3.0', 33751295)

  • [x] Apache Kafka broker version: 3.6.0

  • [x] Client configuration: { # Error reporting 'error_cb': self._error_notification, #'debug': 'broker,topic,msg', 'debug': 'broker,topic,msg', # Use compression 'compression.codec': 'lz4', # Set acks for reliability 'acks': 'all', 'max.in.flight.requests.per.connection': 1, # Configuration settings for HA 'topic.metadata.refresh.interval.ms': 120000 }

  • [x] Operating system: Debian GNU/Linux 11 (bullseye)

  • [x] Provide client logs (with 'debug': '..' as necessary) 2024-02-18 09:14:18,997 [PID=1:TID=Thread-6482:cfx.rda_messaging.kafka_client:_error_notification:788] ERROR - KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://rda-kafka-controller-2.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/2: Connection setup timed out in state CONNECT (after 30025ms in state CONNECT, 1 identical error(s) suppressed)"} 2024-02-18 09:14:18,998 [PID=1:TID=Thread-6482:cfx.rda_messaging.kafka_client:_delivery_notification:759] ERROR - Message delivery failed: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}! Message=b'{"type": "SOURCE-EVENT", "sourceEventId": "f2b4d3de-46a6-4c7c-9243-f57c2489e18b", "id": "f2b4d3de-46a6-4c7c-9243-f57c2489e18b", "sourceSystemId": "56795ec7-f19b-429b-913b-b69bdc4a6baf", "sourceSystemName": "Kafka", "projectId": "337ddf08-cc96-11ee-8d32-0a26cc0b6dec", "eventCategory": "alerts", "customerId": "ac0c16424518460c8c0ef0f632515731", "createdat": 1708244619015.706, "sourceReceivedAt": 1708244619015.6702, "parentEventId": null, "payload": "{\"customerId\": \"58c5fe78-ca58-46a8-8db9-a172fec7d2bc\", \"assetId\": \"021456a7-b56a-4e0b-93a0-017f60a5d75f\", \"componentId\": \"f21056f7-cd02-43f6-b44c-c2fe2bcd12d1\", \"alertCategory\": \"Disk Partition\", \"alertType\": \"Disk Partition\", \"sourceSystemId\": \"CFX_PULSE\", \"projectId\": \"814ab092-83c3-4d0c-8805-063936ba3666\", \"environmentId\": \"a74b7e03-5d48-45fc-95fc-8df9ec47f3b2\", \"severity\": \"MAJOR\", \"assetType\": \"infra-account\", \"assetIpAddress\": \"10.199.0.118\", \"assetName\": \"SIM-m0.0-1-250\", \"componentName\": \"1.1\", \"sourceMechanism\": \"SNMP\", \"message\": \"Disk Partition \\\"flash[Disk:flash]\\\" Utilization [80.95%] exceeded configured value 25.0%\", \"raisedAt\": 1708241018415.1145, \"minimumOccurrence\": 0, \"sourceSystemName\": \"Syslog_udp_Notifications\"}", "status": "Completed"}', Key=b'f2b4d3de-46a6-4c7c-9243-f57c2489e18b' ................. 2024-02-19 05:26:45,329 [PID=1:TID=Thread-70:cfx.rda_messaging.kafka_client:_error_notification:789] ERROR - KafkaError{code=_TRANSPORT,val=-195,str="sasl_ssl://rda-kafka-broker-2.rda-kafka-broker-headless.rda-fabric.svc.cluster.local:9093/bootstrap: Connection setup timed out in state CONNECT (after 30028ms in state CONNECT)"} {"sourceevent": {"id": "06497124-e2fb-4cd3-a348-2cf14b3d849a", "received": 1708320405328.9392}} %7|1708320417.127|PRODUCE|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ac0c16424518460c8c0ef0f632515731.ingestion-tracker [12]: Produce MessageSet with 1 message(s) (1025 bytes, ApiVersion 7, MsgVersion 2, MsgId 0, BaseSeq -1, PID{Invalid}, lz4) %7|1708320417.170|REQERR|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ProduceRequest failed: Broker: Not enough in-sync replicas: explicit actions Retry,MsgNotPersisted %7|1708320417.170|MSGSET|rdkafka#producer-3| [thrd:sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless]: sasl_ssl://rda-kafka-controller-0.rda-kafka-controller-headless.rda-fabric.svc.cluster.local:9093/0: ac0c16424518460c8c0ef0f632515731.ingestion-tracker [12]: MessageSet with 1 message(s) (MsgId 0, BaseSeq -1) encountered error: Broker: Not enough in-sync replicas (actions Retry,MsgNotPersisted)

  • [ ] Provide broker log excerpts

  • [ ] Critical issue

prashantochatterjee avatar Feb 19 '24 08:02 prashantochatterjee

The error _ISR_INSUFF is internally retried in the producer to produce the messages till message.timeout.ms is reached after which _MSG_TIMED_OUT error is thrown. This is the reason, _MSG_TIMED_OUT is thrown and not _ISR_INSUFF. This is by design like this.

pranavrth avatar May 22 '24 14:05 pranavrth

@pranavrth Thanks for your explanation. Is there any way the _ISR_INSUFF code can take precedence over timeout? What's happening is that the real issue then escapes attention.

prashantochatterjee avatar May 29 '24 05:05 prashantochatterjee

This is surely an improvement which we will take later.

pranavrth avatar May 31 '24 10:05 pranavrth