librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

hang in destructor of RdKafka::KafkaConsumer

Open rickif opened this issue 3 years ago • 6 comments

Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ

Do NOT create issues for questions, use the discussion forum: https://github.com/edenhill/librdkafka/discussions

Description

I got a process hang with the librdkafka v1.8.2. I make some investigation about this.

  1. The direct reson is the process is stuck in thread join of the 139923336308480 (0x7f4270c1a700 in hex) destructor of RdKafka::KafkaConsumer image
  2. The thread 0x7f4270c1a700 is joinning another thread 0x7f426ec16700 image
  3. The thread 0x7f426ec16700 is in the infinite loop of https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5266 image The rkb->rkb_refcnt is 3 so the check at https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5266 does not work at all. image

The complete pstack ouput of the process is attached as attachment.

How to reproduce

<your steps how to reproduce goes here, or remove section if not relevant> It's an accident hang and I have not found the way to reproduce it stably.

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • [x] librdkafka version (release number or git tag): v1.8.2
  • [x] Apache Kafka version: 2.3.0
  • [x] librdkafka client configuration: enable.partition.eof=true, enable.auto.offset.store=false, statistics.interval.ms=0, auto.offset.reset=error, api.version.request=true
  • [x] Operating system: CentOS Linux release 7.9.2009
  • [ ] Provide logs (with debug=.. as necessary) from librdkafka
  • [ ] Provide broker log excerpts
  • [ ] Critical issue pstack.txt

rickif avatar Aug 26 '22 12:08 rickif

https://github.com/edenhill/librdkafka/blob/2d78e928d8c0d798f341b1843c97eb6dcdecefc3/src/rdkafka_broker.c#L5393 How about adding this check to the loop condition to solve this problem?

rickif avatar Aug 26 '22 12:08 rickif

@edenhill Is this a known issue?

ruicao93 avatar Sep 12 '22 12:09 ruicao93

FWIW, I work on a codebase where we don't use RdKafka::KafkaConsumer but we do wrap librdkafka in our own C++ wrapper; and we very reproducibly see a hang in our destructor with exactly these symptoms, whenever the destructor is called, if we have a consumer group and if the broker is unreachable (e.g. the Kafka server is stopped first, then after that our consumer is destroyed). Our symptoms are exactly the same: rd_kafka_destroy_app waiting on rd_kafka_destroy_internal waiting on rd_kafka_broker_thread_main, which never exits because rkb->rkb_refcnt == 3. However, I have been unable to make much progress debugging this, because "leaked a refcount somewhere" is such a vague root cause. We have the exact same symptom but that doesn't mean that we necessarily have the same bug at all. This happens for us in 1.8.2, and I have verified that upgrading to 1.9.2 doesn't fix it (for us).

Quuxplusone avatar Sep 23 '22 17:09 Quuxplusone

Make sure all outstanding objects are destroyed prior to calling close.

See https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#termination

edenhill avatar Oct 03 '22 12:10 edenhill

Make sure all outstanding objects are destroyed prior to calling close.

See https://github.com/edenhill/librdkafka/blob/master/INTRODUCTION.md#termination

Actually, what we use is the C++ wrapper RdKafka::KafkaConsumer. @edenhill

rickif avatar Jan 05 '23 09:01 rickif

@rickif Our issue last time we faced similar issues on the node-rdkafka end (which also uses the C++ wrapper) was that we didn't process the last incoming rebalance anymore. Not sure this helps - but helped us.

robinfehr avatar Jan 05 '23 09:01 robinfehr

@rickif , have you been able to solve this issue? I ran into it using rdkafka bindings for Rust, except for me this is triggered by deleting the topic that the consumer is subscribed to right before deleting the consumer. The stack trace looks exactly the same. There are no live Kafka objects in the code except the consumer itself.

ryzhyk avatar Sep 01 '23 05:09 ryzhyk

Upon further investigation, this only happens when I either set enable.auto.offset.store to true or invoke rd_kafka_offset_store (with enable.auto.offset.store set to false).

ryzhyk avatar Sep 01 '23 05:09 ryzhyk