confluent-kafka-dotnet
confluent-kafka-dotnet copied to clipboard
OutOfOrderSequenceException
Description
Hi Team,
I am getting OutOfOrderSequenceException,
Please find my settings
"LingerMs": 20, "BatchNumMessages": 1000, "MessageMaxBytes": 1000000, "MaxInFlight": 4, "Acks": -1, "MessageSendMaxRetries": 1000000, "MessageTimeoutMs": 30000, "EnableIdempotence": true
How to reproduce
I am doing a load test. I am generating a load of 100kb message string in a loop and sending 1 million messages. I am getting the Out of order sequence exception. Is it because of the settin gthe idempotence but not the transactional.id?. Kindly help..
Checklist
Please provide the following information:
- [1-beta2 ] Confluent.Kafka nuget version:
- [2.0 ] Apache Kafka version:
- [ ] Client configuration:
- [ ] Operating system:
- [ ] Provide logs (with "debug" : "..." as necessary in configuration)
- [ ] Provide broker log excerpts
- [ ] Critical issue
first thing to try is reference the librdkafka.redist 1.0.0-RC4 package. 1.0-beta2 references 1.0.0-PRE1 iirc and the idempotent producer functionality in that version was buggy.
@mhowlett : Sure, I will update the library and let you know. Thanks
@mhowlett : After upgrading to the RC4, we get the same error.
@Tenseiga Can you reproduce this with debug
set to eos,msg,broker
and provide us with logs?
Thanks
Make sure to reproduce on librdkafka.redist 1.0.0-RC4
to be sure librdkafka.redist 1.0.0-RC4 is being used, it would be worth checking the value of Library.VersionString
.
@mhowlett Library.VersionString is 1.0.0-Pre2. Does it equal to 1.0.0-RC4?
@mhowlett The Library.VersionString says 1.0.0-Pre2, in the nuget repo there is bo pre2, I see only RC4, assuming this is pre2 is RC4/RC5. I have collected the logs. @edenhill Please find the statistics that you have asked. statistic.txt
I don't think that's librdkafka v1.0.0-RC4.
@edenhill : I have updated the librdkafka.redist to RC4 and rebuilt the Confluent.Kafka.dll and using in my library. I still get 1.0.0-pre2
BTW you will see the COnfluent.Kafka version as 2.0.0,its my local change. Am I doing know?
@mhowlett ?
@edenhill : internally it creates a delegate to 'rd_kafka_version_str' method.
Thing is there is no 1.0.0-pre2 librdkafka version: https://www.nuget.org/packages/librdkafka.redist/
@mhowlett any help on this :)
I hope to release -beta3 later today which will reference librdkafka RC5.
Alternatively you can use the Library.Load
method to explicitly load the librdkafka dependency.
are you on .net framework? I'm not sure if nuget will get the right package in that case (if an explicit librdkafka.redist is referenced) - this may be the problem. It works in my usual environment which is .net core on macos.
Sorry, I forgot to update librdkafka's internal version number so v1.0.0-PRE2 could mean any version since v1.0.0-PRE1 to ..RC-4. Please try to reproduce the issue on Matt's beta3, when it gets out.
@mhowlett : I am using dotnet core 2.1, win 10 pro, vs 2017
the issue may be related to memory management because we won't encounter the exception if we send the same 100kb message for 1 millions times. However, if we send 1 millions different messages (100kb per message), we will get the exception.
@edenhill : Then I am using RC4 release. Yet getting the same issue.
@Tenseiga Can you reproduce this with debug=msg,eos,broker,protocol
and provide the complete debug logs? Please also provide your producer configs. Thanks
@edenhill : Please find the attached. My producer settings
"Topic": "debug-1m-5",
"MessageSize": 102400,
"MessageCount": 1000000,
"Producer": {
"LingerMs": 20,
"BatchNumMessages": 1000,
"MessageMaxBytes": 1000000,
"MaxInFlight": 4,
"Acks": -1,
"MessageSendMaxRetries": 1000000,
"MessageTimeoutMs": 30000,
"EnableIdempotence": true
}
Thanks for the help
Those are statistics (without line breaks...), I need the debug logs (written to console/stderr).
@edenhill : Please find the Log. Trace.txt
BTW My Partition count is 6 and Replication factor 3 , with 3 broker 3 zookeeper set up in Kubernetes.
Thank you, that's very helpful!
@Tenseiga Can you verify what the Debug property was set to for the Trace.txt log run?
Its "msg,eos,broker,protocol".
After thorough analysis of the log files and idempotence code I've come to the conclusion that this is caused by the message timing out due to message.timeout.ms.
What happens, at a high level is:
- messages 1,2,3,4 are sent to the broker
- message 1 is successfully acked by the broker
- the broker hangs, causing the request timeout to kick in for message 2 that is in-flight
- messages 2,3,4 are re-inserted on the producer queue
- the partition timeout scanner runs (once per second) and sees that message 2 has timed out, removes message 2. Producer queue is now 3,4
- the broker comes up again
- producer sends messages from producer queue to the broker: 3,4
- the broker expected message 2, not 3, so returns an OutOfOrderSequence error
- producer receives the error and raises a fatal exception
While this issue was discussed briefly in the KIP-91 discussion thread, the issue exists in the Java client as well.
I'll look into what the best solution is for this.
A lower message.timeout.ms will trigger this behaviour more frequently, the workaround is thus to increase message.timeout.ms towards the max value.
@Tenseiga I don't think you were running librdkafka v1.0.0-RC4. The fatal error message was changed in RC3 to include the topic+partition, but in your log message it is missing, so I think you're using an older version.
FATAL [thrd:10.12.8.136:9092/bootstrap]: Fatal error: Broker: Broker received an out of order sequence number: ProduceRequest with 9 message(s) failed with rewound sequence number on broker 0 (PID{Id:240,Epoch:0}, base seq 25084 < next seq 26464): last error _TIMED_OUT (actions Retry,MsgPossiblyPersisted, base seq 26457, 644ms ago)
https://github.com/edenhill/librdkafka/commit/c2af9b3c49cad80a40da1847068c400f71f37b1b#diff-28f3a6d6e1e2b74d67ec19602f826162R2011
I've tried to reproduce the issue on master (RC5 equivalent) without luck, so I think this specific issue has been fixed. Can you verify by upgrading to confluent-kafka-dotnet v1.0-beta2 (which contains librdkafka v1.0.0-RC5) ?