hang in destructor of RdKafka::KafkaConsumer
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Do NOT create issues for questions, use the discussion forum: https://github.com/edenhill/librdkafka/discussions
Description
I got a process hang with the librdkafka v1.8.2. I make some investigation about this.
- The direct reson is the process is stuck in thread join of the 139923336308480 (0x7f4270c1a700 in hex) destructor of RdKafka::KafkaConsumer

- The thread 0x7f4270c1a700 is joinning another thread 0x7f426ec16700

- The thread 0x7f426ec16700 is in the infinite loop of https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5266
The rkb->rkb_refcnt is 3 so the check at https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5266
does not work at all.

The complete pstack ouput of the process is attached as attachment.
How to reproduce
<your steps how to reproduce goes here, or remove section if not relevant> It's an accident hang and I have not found the way to reproduce it stably.
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
- [x] librdkafka version (release number or git tag):
v1.8.2 - [x] Apache Kafka version:
2.3.0 - [x] librdkafka client configuration:
enable.partition.eof=true, enable.auto.offset.store=false, statistics.interval.ms=0, auto.offset.reset=error, api.version.request=true - [x] Operating system:
CentOS Linux release 7.9.2009 - [ ] Provide logs (with
debug=..as necessary) from librdkafka - [ ] Provide broker log excerpts
- [ ] Critical issue pstack.txt
https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5393 How about adding this check to the loop condition to solve this problem?
@edenhill Is this a known issue?
FWIW, I work on a codebase where we don't use RdKafka::KafkaConsumer but we do wrap librdkafka in our own C++ wrapper; and we very reproducibly see a hang in our destructor with exactly these symptoms, whenever the destructor is called, if we have a consumer group and if the broker is unreachable (e.g. the Kafka server is stopped first, then after that our consumer is destroyed).
Our symptoms are exactly the same: rd_kafka_destroy_app waiting on rd_kafka_destroy_internal waiting on rd_kafka_broker_thread_main, which never exits because rkb->rkb_refcnt == 3. However, I have been unable to make much progress debugging this, because "leaked a refcount somewhere" is such a vague root cause. We have the exact same symptom but that doesn't mean that we necessarily have the same bug at all. This happens for us in 1.8.2, and I have verified that upgrading to 1.9.2 doesn't fix it (for us).
Make sure all outstanding objects are destroyed prior to calling close.
See https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#termination
Make sure all outstanding objects are destroyed prior to calling close.
See https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#termination
Actually, what we use is the C++ wrapper RdKafka::KafkaConsumer. @edenhill
@rickif Our issue last time we faced similar issues on the node-rdkafka end (which also uses the C++ wrapper) was that we didn't process the last incoming rebalance anymore. Not sure this helps - but helped us.
@rickif , have you been able to solve this issue? I ran into it using rdkafka bindings for Rust, except for me this is triggered by deleting the topic that the consumer is subscribed to right before deleting the consumer. The stack trace looks exactly the same. There are no live Kafka objects in the code except the consumer itself.
Upon further investigation, this only happens when I either set enable.auto.offset.store to true or invoke rd_kafka_offset_store (with enable.auto.offset.store set to false).