librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

rd_kafka_consumer_close/rd_kafka_destroy remain blocked indefinitely if the broker is unreachable

Open andreamartini opened this issue 1 year ago • 3 comments

Description

I'm usinng librdkafka-2.0.2, and i'm facing with rd_kafka_consumer_close(myHandle) call blocking (once the broker become unreachable). The same problem appears with rd_kafka_destroy(myHandle). if i skip both api calls, a leakage appears each time a connection lost or unreachable broker happen. Is there a way to avoid this call blocking (without running into memory leakage)? thank you in advance

How to reproduce

Simulate lost connection (i,e, using clumsy) before invoke rd_kafka_consumer_close or rd_kafka_destroy. Checklist

Please provide the following information:

  • [x] librdkafka version (release number or git tag): <v2.0.2>
  • [x ] Apache Kafka version: <v2.8.2>
  • [x] librdkafka client configuration: <client.id="somevalue";bootstrap.servers= "my_Ip";group.id= "somevalue"; statistics.interval.ms=1000>
  • [x] Operating system: <Windows10>
  • [ ] Provide logs (with debug=.. as necessary) from librdkafka
  • [ ] Provide broker log excerpts
  • [x ] Critical issue

andreamartini avatar Nov 22 '23 15:11 andreamartini

Maybe try with the latest release of 2.3.0 first

jpiper avatar Jan 10 '24 09:01 jpiper

Thank you for your answer. I get the same problem with 2.3.0 version.

andreamartini avatar Jan 10 '24 13:01 andreamartini

Lately, we have also encountered this pthread_join deadlock issue since we upgraded from librdkafka version 1 to 2. It did not happen frequently, only something like once in several hundreds of API calls. However, the backtrace is always similar:

(gdb) bt 4
#0  0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x0000560106309f68 in rd_kafka_destroy_app ()
#3  0x0000560106309ff4 in rd_kafka_destroy ()
(More stack frames follow...)

Seems like it is waiting for some thread to join, but either the target thread refused to stop or it did not exist. (blocked for over an hour!)

Version

  • librdkafka: 2.2.0
  • Apache kafka: 3.4.0
  • OS: Debian 11
  • glibc: ldd (Debian GLIBC 2.31-13+deb11u5) 2.31

Debugging

Some more Gdb context:

(gdb) fr 3
#3  0x0000560106309ff4 in rd_kafka_destroy ()
(gdb) x/3i rd_kafka_destroy+21
   0x560106309fec <rd_kafka_destroy+21>:        mov    rdi,rax
   0x560106309fef <rd_kafka_destroy+24>:
    call   0x5601063099e7 <rd_kafka_destroy_app>
=> 0x560106309ff4 <rd_kafka_destroy+29>:        nop
(gdb) fr 2
#2  0x0000560106309f68 in rd_kafka_destroy_app ()
(gdb) x/3i rd_kafka_destroy_app+1401
   0x560106309f60 <rd_kafka_destroy_app+1401>:  mov    rdi,rax
   0x560106309f63 <rd_kafka_destroy_app+1404>:
    call   0x56010627a0c0 <thrd_join@plt>
=> 0x560106309f68 <rd_kafka_destroy_app+1409>:  test   eax,eax
(gdb) fr 1
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) x/3i thrd_join+34
   0x7fb495be57a2 <thrd_join+34>:       mov    rsi,rsp
   0x7fb495be57a5 <thrd_join+37>:
    call   0x7fb495bda270 <__pthread_clockjoin_ex>
=> 0x7fb495be57aa <thrd_join+42>:       test   rbx,rbx
(gdb) x/3i __pthread_clockjoin_ex+563
   0x7fb495bda4a3 <__pthread_clockjoin_ex+563>: mov    eax,ebx
   0x7fb495bda4a5 <__pthread_clockjoin_ex+565>: syscall
=> 0x7fb495bda4a7 <__pthread_clockjoin_ex+567>: mov    edi,r8d
  • I wonder if there's a way to find out the target thread ID it is waiting for?

  • With some more effort, I managed to recover the Posix thread_id from frame 2. It turns out that the blocking thread is "rdk:main", and here's its backtrace:
(gdb) bt
#0  0x00007fb495bda4a7 in __pthread_clockjoin_ex ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be57aa in thrd_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x000056010630a6b0 in rd_kafka_destroy_internal ()
#3  0x000056010630eac1 in rd_kafka_thread_main ()
#4  0x00007fb495bd8f5a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fb4955a8a2f in clone ()
   from /lib/x86_64-linux-gnu/libc.so.6

Looks like it is waiting for another thread! Using similar methods, I discovered the pthread struct from frame 2, where $rdi = *(*($rbp-0x20)):

(gdb) x/7i rd_kafka_destroy_internal+1664
   0x56010630a69a <rd_kafka_destroy_internal+1664>:
    mov    rax,QWORD PTR [rbp-0x20]
   0x56010630a69e <rd_kafka_destroy_internal+1668>:
    mov    rax,QWORD PTR [rax]
   0x56010630a6a1 <rd_kafka_destroy_internal+1671>:
    lea    rdx,[rbp-0x68]
   0x56010630a6a5 <rd_kafka_destroy_internal+1675>:
    mov    rsi,rdx
   0x56010630a6a8 <rd_kafka_destroy_internal+1678>:
    mov    rdi,rax
   0x56010630a6ab <rd_kafka_destroy_internal+1681>:
    call   0x56010627a0c0 <thrd_join@plt>
=> 0x56010630a6b0 <rd_kafka_destroy_internal+1686>:
    mov    rax,QWORD PTR [rbp-0x20]

This time the target thread "rdk:broker237" does not seem to be waiting for someone else, but instead in conditional wait:

(gdb) bt
#0  0x00007fb495bdfad8 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fb495be5b39 in cnd_timedwait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x0000560106465256 in cnd_timedwait_abs ()
#3  0x000056010639a492 in rd_kafka_q_pop_serve ()
#4  0x000056010639a532 in rd_kafka_q_pop ()
#5  0x000056010632f777 in rd_kafka_broker_ops_serve ()
#6  0x000056010632f8bc in rd_kafka_broker_ops_io_serve ()
#7  0x00005601063313fa in rd_kafka_broker_consumer_serve ()
#8  0x0000560106331682 in rd_kafka_broker_serve ()
#9  0x00005601063318d6 in rd_kafka_broker_thread_main ()
#10 0x00007fb495bd8f5a in start_thread ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007fb4955a8a2f in clone ()
   from /lib/x86_64-linux-gnu/libc.so.6

So perhaps the main handler failed to wake this thread up from conditional wait? I'm not familiar with the code base, but I hope the above analysis can bring some insight into why it can be blocking on rd_kafka_destroy.

davidhcefx avatar Mar 14 '24 07:03 davidhcefx