Sarama Async Producer Encounters 'Out of Order' Error: what are the reasons?
Description
We are encountering an error (once every few weeks) while using the async producer in our Kafka setup. The error message encountered is as follows:
assertion failed: message out of sequence added to a batch
This error seems to originate from the following line in the Sarama library: produce_set.go#L89
The occurrence of this error is sporadic, and we are struggling to understand the underlying cause or identify any corrective measures. It appears that, occasionally, messages are being added to the batch in an incorrect order.
We are seeking insights or suggestions on what might be triggering this error. Our investigations have considered network issues as a potential cause; however, we have not found any corresponding logs or indicators to substantiate this theory when the error occurs.
Versions
| Sarama | Kafka | Go |
|---|---|---|
| v1.42.1 | 2.6.2 | 1.20.6 |
Configuration
config := sarama.NewConfig()
config.Version = version
config.Consumer.Group.Rebalance.Strategy = sarama.NewBalanceStrategySticky()
config.Producer.RequiredAcks = sarama.WaitForAll
config.Producer.Idempotent = true
config.Net.MaxOpenRequests = 1
config.Producer.Retry.Max = 100000
config.Producer.Retry.Backoff = 100 * time.Millisecond
config.Producer.Return.Successes = true
config.Producer.Return.Errors = true
config.Producer.Partitioner = sarama.NewHashPartitioner
Logs
We are facing the error detailed at the following location: produce_set.go#L89
Additional Context
All messages are dispatched using an asynchronous producer, configured with a high retry count to ensure message delivery even in the event of transient Kafka broker failures. Despite this, we observe that occasionally a message fails to be added to the batch, rendering it ineligible for any retry mechanism in Sarama.
With the setup described above, we encountered some instances of "The broker received an out of order sequence number" errors recently too. These occurrences are very rare too, but we are wondering if this could indicate an issue with how the messages are being pushed, leading to them being ordered incorrectly.
So this appears to be an ordering issue / race condition between new batches being produced and batches being retried in the idempotent producer:
https://github.com/IBM/sarama/blob/f21c5125746f9d10fd731dfdff54a494098626d1/async_producer.go#L1144-L1148
This shouldn't occur with config.Net.MaxOpenRequests = 1, but we have had other reports (e.g., https://github.com/IBM/sarama/issues/2619) suggesting that when request pipelining was introduced it inadvertently changed the behaviour of the producer such that it lost some of its ordering guarantees
Thank you, @dnwe. Is there currently someone addressing this issue? If not, we're willing to assist and contribute to a solution. Could you provide some guidance on where we might start or what to look into?
I was able to reproduce with a simple async producer that sets:
config.Net.MaxOpenRequests = 1
config.Producer.Idempotent = true
In my case, the trigger that causes the assertion failed: message out of sequence added to a batch message is to interrupt network connectivity between the Sarama client and brokers (connecting to / disconnecting from a VPN).
I don't see the same problem if I switch to using the sync producer in a loop (keeping the same configuration). I suspect this is because my test program will block until Kafka acks each message - effectively preventing the possibility of there being more than one request in flight at any time.
Should be fixed by #2943 if someone can review