confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Facing "Timed out ProduceRequest in flight" messages

Open vikramindian opened this issue 5 years ago • 5 comments

Description

Hi, I'm publishing messages on about 8000 topics using confluent kafka python producer. The cluster is a five node setup. After sometime I observed below timeout messages in my console.

%5|1579869223.925|REQTMOUT|test-8000-py2#producer-1| [thrd:sasl_plaintext://mwkafka-staging-01.nyc.deshaw.com:9092/1]: sasl_plaintext://mwkafka-staging-01.nyc.deshaw.com:9092/1: Timed out ProduceRequest in flight (after 1187ms, timeout #0): possibly held back by preceeding ProduceRequest with timeout in 51063ms %4|1579869223.925|REQTMOUT|test-8000-py2#producer-1| [thrd:sasl_plaintext://mwkafka-staging-01.nyc.deshaw.com:9092/1]: sasl_plaintext://mwkafka-staging-01.nyc.deshaw.com:9092/1: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests

I have an error call back which is giving, KafkaError{code=_TIMED_OUT,val=-185,str="sasl_plaintext://mwkafka-staging-01.nyc.deshaw.com:9092/1: 3 request(s) timed out: disconnect (after 8006ms in state UP)"}

As it says request timedout, I have increased request.timeout.ms to 100000 from 5000(default). But this did not resolve the issue. But when I increased linger.ms to 100, it worked. Why is it so? and why increasing request.timeout did not resolve it ?

Also, why is linger.ms default is very low. It is 0.5 ms

How to reproduce

Checklist

Please provide the following information:

  • [ ] confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):
  • [ ] Apache Kafka broker version:
  • [ ] Client configuration: {...}
  • [ ] Operating system:
  • [ ] Provide client logs (with 'debug': '..' as necessary)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

vikramindian avatar Jan 24 '20 13:01 vikramindian

Setting linger.ms facilitates batching of your requests meaning you are sending more messages per request. With the default you are achieving almost no batching meaning you are sending very few messages per produce request. Sending lots and lots of small requests will induce some head of line blocking which will accumulate over time and eventually your requests will fail to be successfully sent and acknowledged by the broker with in the required window.

The default for linger.ms comes from the java client which is actually 0 and bumped to .5 ms when sub millisecond linger.ms support was added in 1.2.0. Many of the original defaults treat latency as the highest priority as opposed to increased throughput. This is why acks was originally set to 1 as opposed to -1(all) up until 1.0. Where possible we try to maintain parity with the Java as to keep the experience as consistent as we can. We do diverge in some places but that is what we strive for.

With that said you may enjoy the following post which describes the trade-offs you must consider when configuring clients. This is java focused but as mentioned we play by the same rules.

https://www.confluent.io/blog/optimizing-apache-kafka-deployment/

rnpridgeon avatar Jan 27 '20 11:01 rnpridgeon

@rnpridgeon Thank you for the reply.

vikramindian avatar Jan 29 '20 10:01 vikramindian

Hi, I have few more observations here.

  • I have created 5000 topics and made a producer to publish on these topics. I did not receive any " request timed out" messages in error callback.
  • I have created 5000 more topics (total 10,000) and again made my producer to publish on only 5000 topics that were created earlier. This time I observed timeout messages in error call back. KafkaError{code=_TIMED_OUT,val=-185,str="sasl_plaintext://mwkafka-staging-01.tbd.deshaw.com:9092/5: 2 request(s) timed out: disconnect (after 2002ms in state UP)"}

I have monitored stats of all brokers (top command) and observed that virtual size was higher in case 2 for all brokers (almost double) It looks abnormal to me. Can you look into this?

vikramindian avatar Feb 03 '20 08:02 vikramindian

I'm also facing the same issue. will be good if someone can post a fix.

nverma-ocrolus avatar Oct 08 '22 10:10 nverma-ocrolus

When I produce 1000000 messages to a topic, it works flawlessly.

Produced 1000000 messages in 15.84 seconds, throughput: 63150.40 messages/second

Anything set for max_messages > 1M errors with timeout even after optimizing for linger.ms and batch.size

Reviewing https://github.com/confluentinc/confluent-kafka-python/blob/4f25c8cd2d7d707b3705c95dd2e0dca81dc0fd55/src/confluent_kafka/src/Consumer.c#L1066, are we capped

isardam avatar Dec 31 '24 17:12 isardam

Closing the issue, but also giving some guidance for anyone who land here in the future. In general Kafka is a streaming system so it's oriented around fast and reliable delivery in a small window of time. Sending 1M+ messages at a time is much more of a batch operation than a streaming one. There's buffers and message protocols that become less efficient or reliable the more you try to operate in a batch mode, thus some hard limits are in place -- though I personally don't know where tipping points are for those mechanisms. It's advised to send smaller batches until you exhaust a source when doing migrations or DB uploads to Kafka rather than one giant produce call and can enter a more standard stream publication mode.

MSeal avatar Jun 26 '25 23:06 MSeal