Kafka telemetry producer may reach state where OUT_OF_ORDER_SEQUENCE_NUMBER errors get thrown
We have observed on two occasions, that Kafka telemetry producers in a protocol adapter reached a state where such errors get logged:
WARN org.apache.kafka.clients.producer.internals.Sender [Producer clientId=hono-http-adapter-producer-telemetry-b2e48d79-089d-45a0-bcd7-4db4c4100ecf] Got error produce response with correlation id 24213 on topic-partition hono.telemetry.thealthcheckkafka1-8, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
In the metrics, this was seen in a higher retry rate.
For most producers, the errors stopped at some point (e.g. after one hour).
According to the documentation, continued use of such producers "risks reordering of sent records".
Looking at the time the errors started, there were these log entries:
Dec 8, 2021 @ 09:39:43.371 [Producer clientId=hono-http-adapter-producer-telemetry-uid] Received invalid metadata error in produce request on partition hono.telemetry.my-tenant-8 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now
Dec 8, 2021 @ 09:39:43.371 [Producer clientId=hono-http-adapter-producer-telemetry-uid] Got error produce response with correlation id 24125 on topic-partition hono.telemetry.my-tenant-8, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION
Dec 8, 2021 @ 09:39:47.121 org.eclipse.hono.client.kafka.producer.CachingKafkaProducerFactory
producer error occurred [clientId: hono-http-adapter-producer-telemetry-uid]
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for hono.telemetry.my-tenant-8:5001 ms has passed since batch creation
Dec 8, 2021 @ 09:39:47.737 [Producer clientId=hono-http-adapter-producer-telemetry-uid] Got error produce response with correlation id 24133 on topic-partition hono.telemetry.thealthcheckkafka0-6, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
So, an exception caught in Hono code here before the OUT_OF_ORDER_SEQUENCE_NUMBER errors is
TimeoutException: Expiring 1 record(s) for [topic]:5001 ms has passed since batch creation.
It should be evaluated, whether we might better stop using a producer after such an exception (e.g. by letting CachingKafkaProducerFactory#getOrCreateProducer return a new producer for the next record). The original producer should then be closed eventually.
Note that the Expiring x records exception was repeated x times (i.e. for each expired record).
Log entries corresponding to the recurring errors:
Jan 1, 2022 @ 23:38:56.310 WARN org.apache.kafka.clients.producer.internals.Sender [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55429 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:38:56.310 INFO org.apache.kafka.clients.producer.internals.TransactionManager [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25299
Jan 1, 2022 @ 23:38:56.310 INFO org.apache.kafka.clients.producer.internals.ProducerBatch Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0 to 0
or
Jan 1, 2022 @ 23:39:16.372 WARN org.apache.kafka.clients.producer.internals.Sender [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55441 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_0-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:39:16.372 WARN org.apache.kafka.clients.producer.internals.Sender [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] Got error produce response with correlation id 55442 on topic-partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0, retrying (2147483646 attempts left). Error: OUT_OF_ORDER_SEQUENCE_NUMBER
Jan 1, 2022 @ 23:39:16.372 INFO org.apache.kafka.clients.producer.internals.TransactionManager [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25305
Jan 1, 2022 @ 23:39:16.372 INFO org.apache.kafka.clients.producer.internals.ProducerBatch Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_0-0 to 0
Jan 1, 2022 @ 23:39:16.372 INFO org.apache.kafka.clients.producer.internals.TransactionManager [Producer clientId=hono-cmd-router-internal-cmd-sender-uuid] ProducerId set to 3896943 with epoch 25306
Jan 1, 2022 @ 23:39:16.372 INFO org.apache.kafka.clients.producer.internals.ProducerBatch Resetting sequence number of batch with current sequence 1 for partition hono.command_internal.hono-adapter-amqp-vertx-66f985899f-hsk85_74894f5112a0_1-0 to 0
It seems that each new "send" batch first gets the OutOfOrderSequenceException, then the retry succeeds. The number of send operations is roughly twice the number of OutOfOrderSequenceException errors.
The epoch in ProducerId set to 3896943 with epoch 25306 is incremented each time.
The errors seem to resolve itself, after the epoch has reached the maximum value (32767), so that it is then reset to -1 :
ProducerId set to -1 with epoch -1.
The errors here were seen on a cluster with Hono 1.10, using Kafka clients with version 2.6.0. In Kafka 2.8.1, used in Hono 1.11, there is a bug fixed that might be related to the above errors: KAFKA-12152 ("Idempotent Producer does not reset the sequence number of partitions without in-flight batches") So it might be, that the issue doesn't occur anymore with Hono 1.11. This is to be tested.
@calohmn any news on this one? Is this still relevant?
I'm still monitoring whether this issue still occurs on a system updated to Hono 1.11 / Kafka 2.8.1. Therefore re-scheduling this for after Hono 1.12.
This issue wasn't observed any more using a Kafka client 2.8.x version including some fixes backported from later Kafka versions. Therefore I think this issue can be closed when doing the Kafka client 3.2.1 upgrade in #3368.
Used Kafka client has been upgraded to 3.2.1, therefore this issue should be obsolete.