hono Kafka telemetry producer may reach state where OUT_OF_ORDER_SEQUENCE

We have observed on two occasions, that Kafka telemetry producers in a protocol adapter reached a state where such errors get logged:

WARN org.apache.kafka.clients.producer.internals.Sender [Producer clientId=hono-http-adapter-producer-telemetry-b2e48d79-089d-45a0-bcd7-4db4c4100ecf] Got error produce response with correlation id 24213 on topic-partition hono.telemetry.thealthcheckkafka1-8, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER

In the metrics, this was seen in a higher retry rate.

For most producers, the errors stopped at some point (e.g. after one hour).

According to the documentation, continued use of such producers "risks reordering of sent records".

Looking at the time the errors started, there were these log entries:

Dec 8, 2021 @ 09:39:43.371    [Producer clientId=hono-http-adapter-producer-telemetry-uid] Received invalid metadata error in produce request on partition hono.telemetry.my-tenant-8 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now

Dec 8, 2021 @ 09:39:43.371    [Producer clientId=hono-http-adapter-producer-telemetry-uid] Got error produce response with correlation id 24125 on topic-partition hono.telemetry.my-tenant-8, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION

Dec 8, 2021 @ 09:39:47.121    org.eclipse.hono.client.kafka.producer.CachingKafkaProducerFactory
producer error occurred [clientId: hono-http-adapter-producer-telemetry-uid]
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for hono.telemetry.my-tenant-8:5001 ms has passed since batch creation 

Dec 8, 2021 @ 09:39:47.737    [Producer clientId=hono-http-adapter-producer-telemetry-uid] Got error produce response with correlation id 24133 on topic-partition hono.telemetry.thealthcheckkafka0-6, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER

So, an exception caught in Hono code here before the OUT_OF_ORDER_SEQUENCE_NUMBER errors is TimeoutException: Expiring 1 record(s) for [topic]:5001 ms has passed since batch creation.

It should be evaluated, whether we might better stop using a producer after such an exception (e.g. by letting CachingKafkaProducerFactory#getOrCreateProducer return a new producer for the next record). The original producer should then be closed eventually. Note that the Expiring x records exception was repeated x times (i.e. for each expired record).

Dec 09 '21 16:12 calohmn

Log entries corresponding to the recurring errors:

Jan 1, 2022 @ 23:38:56.310  WARN   org.apache.kafka.clients.producer.internals.Sender  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55429 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:38:56.310  INFO  org.apache.kafka.clients.producer.internals.TransactionManager  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25299
Jan 1, 2022 @ 23:38:56.310  INFO  org.apache.kafka.clients.producer.internals.ProducerBatch  Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0 to 0

or

Jan 1, 2022 @ 23:39:16.372  WARN  org.apache.kafka.clients.producer.internals.Sender  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55441 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_0-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:39:16.372  WARN  org.apache.kafka.clients.producer.internals.Sender  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55442 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:39:16.372  INFO  org.apache.kafka.clients.producer.internals.TransactionManager  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25305
Jan 1, 2022 @ 23:39:16.372  INFO  org.apache.kafka.clients.producer.internals.ProducerBatch  Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_0-0 to 0
Jan 1, 2022 @ 23:39:16.372  INFO  org.apache.kafka.clients.producer.internals.TransactionManager  [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25306
Jan 1, 2022 @ 23:39:16.372  INFO  org.apache.kafka.clients.producer.internals.ProducerBatch  Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0 to 0

It seems that each new "send" batch first gets the OutOfOrderSequenceException, then the retry succeeds. The number of send operations is roughly twice the number of OutOfOrderSequenceException errors.

The epoch in ProducerId set to 3896943 with epoch 25306 is incremented each time.

The errors seem to resolve itself, after the epoch has reached the maximum value (32767), so that it is then reset to -1 : ProducerId set to -1 with epoch -1.

The errors here were seen on a cluster with Hono 1.10, using Kafka clients with version 2.6.0. In Kafka 2.8.1, used in Hono 1.11, there is a bug fixed that might be related to the above errors: KAFKA-12152 ("Idempotent Producer does not reset the sequence number of partitions without in-flight batches") So it might be, that the issue doesn't occur anymore with Hono 1.11. This is to be tested.

Jan 04 '22 14:01 calohmn

@calohmn any news on this one? Is this still relevant?

Feb 02 '22 10:02 sophokles73

I'm still monitoring whether this issue still occurs on a system updated to Hono 1.11 / Kafka 2.8.1. Therefore re-scheduling this for after Hono 1.12.

Feb 03 '22 14:02 calohmn

This issue wasn't observed any more using a Kafka client 2.8.x version including some fixes backported from later Kafka versions. Therefore I think this issue can be closed when doing the Kafka client 3.2.1 upgrade in #3368.

Aug 19 '22 08:08 calohmn

Used Kafka client has been upgraded to 3.2.1, therefore this issue should be obsolete.

Aug 22 '22 11:08 calohmn

Kafka telemetry producer may reach state where OUT_OF_ORDER_SEQUENCE_NUMBER errors get thrown