parallel-consumer
parallel-consumer copied to clipboard
Bug: ConsumerOffsetCommitter goes into failure state after broker downtime
During testing we observed an issue that the ConsumerOffsetCommitter apparently tries endlessly to commit offsets without success after a broker downtime.
Test setup:
- Application is started (4 Java processes reading from the same topic which has 24 partitions)
- Application is processing data without issues
- Kafka Broker is killed
- Application has cached messages that were already polled but not processed yet
- Kafka Broker is started again
- Application starts processing again but half of the instances has very low throughput
- In the logs we keep seeing this error:
[ERROR] 2022-03-01 09:16:37.934 [pc-broker-poll] i.c.p.i.ConsumerOffsetCommitter - Error committing offsets org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets. Caused by: org.apache.kafka.common.errors.DisconnectException
It seems like the application tries to process messages it still had cached but which were from a partition that was assigned to another application instance and therefor the offset committing doesn't work.
Related:
- #185
Hmmm interesting. Can you post the full stack trace? And how long has it been left to try to recover after the broker comes back online, before terminating? What was the replication factor of the partition, and can you confirm a new leader was assigned while the broker was offline?
And how many brokers were in this cluster?
Also, while it's continuing to try to commit, do you know if it's still processing new messages or polling more from the brokers?
Closing Issue