kafka icon indicating copy to clipboard operation
kafka copied to clipboard

Timed out OffsetCommitRequest in flight (after 60628ms, timeout #0)\")"

Open chunaiarun opened this issue 1 year ago • 5 comments

Describe the bug 32 consumers started to process 7 million messages from 32 partitions ( from the kafka queue) We are doing manual async commit

To Reproduce

  • Have 7 million messages produced and segregated equally into 32 partitions
  • Configure consumer code to do "manual async commit"
  • then start 32 consumers at the same time
  • we saw below timeout error in almost all consumer logs.

Expected behavior "(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #0)")" "(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #1)")" "(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #2)")" "(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #3)")" "(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #4)")" "(4i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out 66792 in-flight, 0 retry-queued, 95763 out-queue, 0 partially-sent requests")" "(3i;"FAIL";"[thrd:GroupCoordinator]: GroupCoordinator: :443: 162555 request(s) timed out

Looks like internally the library retries to commit and does it successfully, and I see the LAG goes down fine, 'but how to avoid this... What is the max retry ms it does ? what is the retry related librdkafka configuration (https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md) to be on the safer side should I increase/adjust any configuration

FYI (fetch.wait.max.ms;10); (statistics.interval.ms;10000); (enable.auto.commit;false); (enable.auto.offset.store;false); (message.max.bytes;1000000000) );

.kfk.CommitOffsets[x[client];x[topic];((enlist x[partition])!(enlist 1+x[offset]));1b];] We are using async commit here

Desktop (please complete the following information):

  • OS: Linux
  • 12 CPUs

chunaiarun avatar Aug 13 '24 12:08 chunaiarun

Correction on what I said above on the retry commit point "Looks like internally the library retries to commit and does it successfully". What I observe as per log is, all the async commit retries have failed... since we are processing messages sequentially from each partition, even if we see this error, the subsequent async commits are committing the offset and it is moving ahead

chunaiarun avatar Aug 13 '24 14:08 chunaiarun

This doc may be of relevance - https://github.com/confluentinc/librdkafka/wiki/FAQ ( section: https://github.com/confluentinc/librdkafka/wiki/FAQ#why-committing-each-message-is-slow ). If you pay for Kafka support (e.g. via Confluent) they be of additional help in debugging/advising.

sshanks-kx avatar Aug 15 '24 10:08 sshanks-kx

Thank you very much

chunaiarun avatar Aug 20 '24 04:08 chunaiarun

As per reference doc it suggests to use Message store, but it will work with auto commit true right?

chunaiarun avatar Aug 26 '24 12:08 chunaiarun

enable.auto.commit=true batch commits messages synchronously on a timer - so the consumer will wait till commits done. As its not doing async (depending on your app/design) it'll have much less messages in-flight while potentially consuming slower.

Ref: https://docs.confluent.io/platform/current/clients/consumer.html#offset-management https://medium.com/@rramiz.rraza/kafka-programming-different-ways-to-commit-offsets-7bcd179b225a https://medium.com/apache-kafka-from-zero-to-hero/apache-kafka-guide-36-consumer-offset-commit-strategies-41ef6bf34fcd

sshanks-kx avatar Aug 26 '24 15:08 sshanks-kx

Please reopen if theres an issue. Thanks

sshanks-kx avatar Sep 13 '24 15:09 sshanks-kx