kafka
kafka copied to clipboard
Timed out OffsetCommitRequest in flight (after 60628ms, timeout #0)\")"
Describe the bug 32 consumers started to process 7 million messages from 32 partitions ( from the kafka queue) We are doing manual async commit
To Reproduce
- Have 7 million messages produced and segregated equally into 32 partitions
- Configure consumer code to do "manual async commit"
- then start 32 consumers at the same time
- we saw below timeout error in almost all consumer logs.
Expected behavior
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #0)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #1)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #2)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #3)")"
"(5i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out OffsetCommitRequest in flight (after 60628ms, timeout #4)")"
"(4i;"REQTMOUT";"[thrd:GroupCoordinator]: GroupCoordinator/2: Timed out 66792 in-flight, 0 retry-queued, 95763 out-queue, 0 partially-sent requests")"
"(3i;"FAIL";"[thrd:GroupCoordinator]: GroupCoordinator:
Looks like internally the library retries to commit and does it successfully, and I see the LAG goes down fine, 'but how to avoid this... What is the max retry ms it does ? what is the retry related librdkafka configuration (https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md) to be on the safer side should I increase/adjust any configuration
FYI
(fetch.wait.max.ms;10);
(statistics.interval.ms;10000);
(enable.auto.commit;false);
(enable.auto.offset.store;false);
(message.max.bytes;1000000000)
);
.kfk.CommitOffsets[x[client];x[topic];((enlist x[partition])!(enlist 1+x[offset]));1b];] We are using async commit here
Desktop (please complete the following information):
- OS: Linux
- 12 CPUs
Correction on what I said above on the retry commit point "Looks like internally the library retries to commit and does it successfully". What I observe as per log is, all the async commit retries have failed... since we are processing messages sequentially from each partition, even if we see this error, the subsequent async commits are committing the offset and it is moving ahead
This doc may be of relevance - https://github.com/confluentinc/librdkafka/wiki/FAQ ( section: https://github.com/confluentinc/librdkafka/wiki/FAQ#why-committing-each-message-is-slow ). If you pay for Kafka support (e.g. via Confluent) they be of additional help in debugging/advising.
Thank you very much
As per reference doc it suggests to use Message store, but it will work with auto commit true right?
enable.auto.commit=true batch commits messages synchronously on a timer - so the consumer will wait till commits done. As its not doing async (depending on your app/design) it'll have much less messages in-flight while potentially consuming slower.
Ref: https://docs.confluent.io/platform/current/clients/consumer.html#offset-management https://medium.com/@rramiz.rraza/kafka-programming-different-ways-to-commit-offsets-7bcd179b225a https://medium.com/apache-kafka-from-zero-to-hero/apache-kafka-guide-36-consumer-offset-commit-strategies-41ef6bf34fcd
Please reopen if theres an issue. Thanks