confluent-kafka-dotnet icon indicating copy to clipboard operation
confluent-kafka-dotnet copied to clipboard

OutOfOrderSequenceException

Open srinathrangaramanujam opened this issue 6 years ago • 35 comments

Description

Hi Team,

I am getting OutOfOrderSequenceException,

Please find my settings

"LingerMs": 20, "BatchNumMessages": 1000, "MessageMaxBytes": 1000000, "MaxInFlight": 4, "Acks": -1, "MessageSendMaxRetries": 1000000, "MessageTimeoutMs": 30000, "EnableIdempotence": true

How to reproduce

I am doing a load test. I am generating a load of 100kb message string in a loop and sending 1 million messages. I am getting the Out of order sequence exception. Is it because of the settin gthe idempotence but not the transactional.id?. Kindly help..

Checklist

Please provide the following information:

  • [1-beta2 ] Confluent.Kafka nuget version:
  • [2.0 ] Apache Kafka version:
  • [ ] Client configuration:
  • [ ] Operating system:
  • [ ] Provide logs (with "debug" : "..." as necessary in configuration)
  • [ ] Provide broker log excerpts
  • [ ] Critical issue

srinathrangaramanujam avatar Dec 17 '18 03:12 srinathrangaramanujam

first thing to try is reference the librdkafka.redist 1.0.0-RC4 package. 1.0-beta2 references 1.0.0-PRE1 iirc and the idempotent producer functionality in that version was buggy.

mhowlett avatar Dec 17 '18 04:12 mhowlett

@mhowlett : Sure, I will update the library and let you know. Thanks

srinathrangaramanujam avatar Dec 17 '18 05:12 srinathrangaramanujam

@mhowlett : After upgrading to the RC4, we get the same error.

srinathrangaramanujam avatar Dec 17 '18 09:12 srinathrangaramanujam

@Tenseiga Can you reproduce this with debug set to eos,msg,broker and provide us with logs? Thanks

edenhill avatar Dec 17 '18 09:12 edenhill

Make sure to reproduce on librdkafka.redist 1.0.0-RC4

edenhill avatar Dec 17 '18 09:12 edenhill

to be sure librdkafka.redist 1.0.0-RC4 is being used, it would be worth checking the value of Library.VersionString.

mhowlett avatar Dec 17 '18 12:12 mhowlett

@mhowlett Library.VersionString is 1.0.0-Pre2. Does it equal to 1.0.0-RC4?

Bo-Yee-Woods avatar Dec 18 '18 02:12 Bo-Yee-Woods

@mhowlett The Library.VersionString says 1.0.0-Pre2, in the nuget repo there is bo pre2, I see only RC4, assuming this is pre2 is RC4/RC5. I have collected the logs. @edenhill Please find the statistics that you have asked. statistic.txt

srinathrangaramanujam avatar Dec 18 '18 04:12 srinathrangaramanujam

I don't think that's librdkafka v1.0.0-RC4.

edenhill avatar Dec 18 '18 06:12 edenhill

@edenhill : I have updated the librdkafka.redist to RC4 and rebuilt the Confluent.Kafka.dll and using in my library. I still get 1.0.0-pre2 image

BTW you will see the COnfluent.Kafka version as 2.0.0,its my local change. Am I doing know?

srinathrangaramanujam avatar Dec 18 '18 09:12 srinathrangaramanujam

@mhowlett ?

edenhill avatar Dec 18 '18 09:12 edenhill

@edenhill : internally it creates a delegate to 'rd_kafka_version_str' method.

srinathrangaramanujam avatar Dec 18 '18 09:12 srinathrangaramanujam

Thing is there is no 1.0.0-pre2 librdkafka version: https://www.nuget.org/packages/librdkafka.redist/

edenhill avatar Dec 18 '18 09:12 edenhill

@mhowlett any help on this :)

srinathrangaramanujam avatar Dec 18 '18 14:12 srinathrangaramanujam

I hope to release -beta3 later today which will reference librdkafka RC5.

Alternatively you can use the Library.Load method to explicitly load the librdkafka dependency.

are you on .net framework? I'm not sure if nuget will get the right package in that case (if an explicit librdkafka.redist is referenced) - this may be the problem. It works in my usual environment which is .net core on macos.

mhowlett avatar Dec 18 '18 16:12 mhowlett

Sorry, I forgot to update librdkafka's internal version number so v1.0.0-PRE2 could mean any version since v1.0.0-PRE1 to ..RC-4. Please try to reproduce the issue on Matt's beta3, when it gets out.

edenhill avatar Dec 18 '18 16:12 edenhill

@mhowlett : I am using dotnet core 2.1, win 10 pro, vs 2017

srinathrangaramanujam avatar Dec 19 '18 01:12 srinathrangaramanujam

the issue may be related to memory management because we won't encounter the exception if we send the same 100kb message for 1 millions times. However, if we send 1 millions different messages (100kb per message), we will get the exception.

Bo-Yee-Woods avatar Dec 19 '18 02:12 Bo-Yee-Woods

@edenhill : Then I am using RC4 release. Yet getting the same issue.

srinathrangaramanujam avatar Dec 19 '18 02:12 srinathrangaramanujam

@Tenseiga Can you reproduce this with debug=msg,eos,broker,protocol and provide the complete debug logs? Please also provide your producer configs. Thanks

edenhill avatar Dec 19 '18 09:12 edenhill

@edenhill : Please find the attached. My producer settings

"Topic": "debug-1m-5",
"MessageSize": 102400, 
"MessageCount": 1000000,
"Producer": {
  "LingerMs": 20,
  "BatchNumMessages": 1000,
  "MessageMaxBytes": 1000000,
  "MaxInFlight": 4,
  "Acks": -1,
  "MessageSendMaxRetries": 1000000,
  "MessageTimeoutMs": 30000,
  "EnableIdempotence": true
}

Statistics.txt

Thanks for the help

srinathrangaramanujam avatar Dec 19 '18 15:12 srinathrangaramanujam

Those are statistics (without line breaks...), I need the debug logs (written to console/stderr).

edenhill avatar Dec 19 '18 15:12 edenhill

@edenhill : Please find the Log. Trace.txt

srinathrangaramanujam avatar Dec 19 '18 17:12 srinathrangaramanujam

BTW My Partition count is 6 and Replication factor 3 , with 3 broker 3 zookeeper set up in Kubernetes.

srinathrangaramanujam avatar Dec 19 '18 17:12 srinathrangaramanujam

Thank you, that's very helpful!

edenhill avatar Dec 19 '18 21:12 edenhill

@Tenseiga Can you verify what the Debug property was set to for the Trace.txt log run?

edenhill avatar Dec 21 '18 14:12 edenhill

Its "msg,eos,broker,protocol".

srinathrangaramanujam avatar Dec 22 '18 00:12 srinathrangaramanujam

This a fresh set of traces, Might help you

Trace.zip

srinathrangaramanujam avatar Dec 22 '18 00:12 srinathrangaramanujam

After thorough analysis of the log files and idempotence code I've come to the conclusion that this is caused by the message timing out due to message.timeout.ms.

What happens, at a high level is:

  1. messages 1,2,3,4 are sent to the broker
  2. message 1 is successfully acked by the broker
  3. the broker hangs, causing the request timeout to kick in for message 2 that is in-flight
  4. messages 2,3,4 are re-inserted on the producer queue
  5. the partition timeout scanner runs (once per second) and sees that message 2 has timed out, removes message 2. Producer queue is now 3,4
  6. the broker comes up again
  7. producer sends messages from producer queue to the broker: 3,4
  8. the broker expected message 2, not 3, so returns an OutOfOrderSequence error
  9. producer receives the error and raises a fatal exception

While this issue was discussed briefly in the KIP-91 discussion thread, the issue exists in the Java client as well.

I'll look into what the best solution is for this.

A lower message.timeout.ms will trigger this behaviour more frequently, the workaround is thus to increase message.timeout.ms towards the max value.

edenhill avatar Jan 12 '19 08:01 edenhill

@Tenseiga I don't think you were running librdkafka v1.0.0-RC4. The fatal error message was changed in RC3 to include the topic+partition, but in your log message it is missing, so I think you're using an older version.

FATAL [thrd:10.12.8.136:9092/bootstrap]: Fatal error: Broker: Broker received an out of order sequence number: ProduceRequest with 9 message(s) failed with rewound sequence number on broker 0 (PID{Id:240,Epoch:0}, base seq 25084 < next seq 26464): last error _TIMED_OUT (actions Retry,MsgPossiblyPersisted, base seq 26457, 644ms ago)

https://github.com/edenhill/librdkafka/commit/c2af9b3c49cad80a40da1847068c400f71f37b1b#diff-28f3a6d6e1e2b74d67ec19602f826162R2011

I've tried to reproduce the issue on master (RC5 equivalent) without luck, so I think this specific issue has been fixed. Can you verify by upgrading to confluent-kafka-dotnet v1.0-beta2 (which contains librdkafka v1.0.0-RC5) ?

edenhill avatar Jan 14 '19 10:01 edenhill