librdkafka icon indicating copy to clipboard operation
librdkafka copied to clipboard

Chasing a segfault during termination in rdkafka-ruby followup

Open thijsc opened this issue 2 years ago • 1 comments

Following up on #3981

We have been working on various approaches to make the Ruby gem segfault free, but so far have not succeeded. The latest and most successful approach in https://github.com/appsignal/rdkafka-ruby/pull/227 makes segfaults very rare on exit, but they do still sometimes happen. With this approach we use mutexes to guarantee that the destroy does not happen at the same time as polling or any other access. After destroy the native client can't be accessed anymore.

I'm not sure where to go from here. I did notice that not destroying the native client also leads to the process crashing with a segfault. I'm wondering if maybe sometimes the Ruby processes exit without the finalizers that call the destroy logic running in time. What is the expected behavior if you don't destroy the native client at all?

Do you have any further tips on stuff we could look at?

thijsc avatar Dec 15 '22 14:12 thijsc

If you don't call rd_kafka_destroy() the client will just continue its business in its background threads.

I don't know well the ruby runtime plays with valgrind or asan, but I would try those two approaches to get some hints.

If that doesn't work, use gdb and start printf():ing pointers in librdkafka so you can track down from the gdb crash where the piece of memory belongs - gnarly work.

edenhill avatar Dec 15 '22 15:12 edenhill

We're running into the same issue as #3981.

Here's the relevant backtrace:

#4  0x000055eecec7824d in sigsegv (sig=11, info=0x7f76f5e5c930, ctx=0x7f76f5e5c800) at signal.c:964
#5  <signal handler called>
#6  0x00007f770525faab in __pthread_clockjoin_ex (threadid=threadid@entry=140143353931520, thread_return=thread_return@entry=0x7f75e4ad37e0, clockid=clockid@entry=0, 
    abstime=abstime@entry=0x0, block=block@entry=true) at pthread_join_common.c:89
#7  0x00007f770526bb7e in thrd_join (thr=thr@entry=140143353931520, res=res@entry=0x7f75e4ad382c) at thrd_join.c:25
#8  0x00007f76f3f8aac7 in rd_kafka_destroy_app (rk=0x7f76c091c800, flags=<optimized out>) at rdkafka.c:1107
#9  0x00007f76fa5c0ff5 in ?? () from /lib/x86_64-linux-gnu/libffi.so.7
#10 0x00007f76fa5c040a in ?? () from /lib/x86_64-linux-gnu/libffi.so.7
#11 0x00007f76f41987e9 in rbffi_CallFunction (argc=<optimized out>, argv=<optimized out>, function=0x7f76f3f8b470 <rd_kafka_destroy>, fnInfo=0x7f76f78466a0) at Call.c:400
#12 0x00007f76f419c787 in custom_trampoline (argc=<optimized out>, argv=<optimized out>, self=<optimized out>, handle=<optimized out>) at MethodHandle.c:220
#13 0x000055eeced07ae5 in vm_call_cfunc_with_frame (ec=0x7f76ecfb46d0, reg_cfp=0x7f75e4bd4150, calling=<optimized out>) at vm_insnhelper.c:3270

It's segfaulting on the call to thrd_join on rk_thread. But there doesn't seem to be a thread running rd_kafka_thread_main (they all appear to be Ruby threads):

(gdb) info thread
  Id   Target Id                         Frame 
* 1    Thread 0x7f75e4bd6700 (LWP 65954) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  2    Thread 0x7f76a99ff700 (LWP 65861) 0x00007f770515ba96 in __ppoll (fds=fds@entry=0x7f76a98fcdd8, nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x0, 
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
  3    Thread 0x7f7704e63000 (LWP 65800) futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffd3c6cd3e0, clockid=<optimized out>, expected=0, futex_word=0x7f7704a4210c)
    at ../sysdeps/nptl/futex-internal.h:320
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f76a99ff700 (LWP 65861))]
#0  0x00007f770515ba96 in __ppoll (fds=fds@entry=0x7f76a98fcdd8, nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
(gdb) bt
#0  0x00007f770515ba96 in __ppoll (fds=fds@entry=0x7f76a98fcdd8, nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000055eececbe1c0 in rb_sigwait_sleep (th=th@entry=0x7f76f2c06500, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x0) at hrtime.h:158
#2  0x000055eececc1d51 in native_sleep (th=th@entry=0x7f76f2c06500, rel=0x0) at thread_pthread.c:2346
#3  0x000055eececc1ff0 in sleep_forever (fl=1, th=<optimized out>) at thread.c:1274
#4  rb_thread_sleep_deadly_allow_spurious_wakeup (blocker=<optimized out>, timeout=<optimized out>, end=<optimized out>) at thread.c:1398
#5  0x000055eececc2138 in queue_sleep (_args=_args@entry=140147627642848) at thread_sync.c:1007
#6  0x000055eeceb1cc54 in rb_ensure (b_proc=b_proc@entry=0x55eececc2120 <queue_sleep>, data1=data1@entry=140147627642848, e_proc=e_proc@entry=0x55eececb59f0 <queue_sleep_done>, 
    data2=data2@entry=140147627642880) at eval.c:1007
#7  0x000055eececbe796 in queue_do_pop (self=self@entry=140148584338040, q=q@entry=0x7f76c6217e00, should_block=should_block@entry=1, timeout=timeout@entry=4) at thread_sync.c:1082
#8  0x000055eececbe883 in szqueue_do_pop (timeout=4, should_block=1, self=140148584338040) at thread_sync.c:1302
#9  rb_szqueue_pop (ec=<optimized out>, self=140148584338040, non_block=<optimized out>, timeout=4) at thread_sync.c:1313
#10 0x000055eeced1bb65 in invoke_bf (reg_cfp=<optimized out>, bf=<optimized out>, bf=<optimized out>, argv=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:6331
#11 vm_invoke_builtin_delegate (start_index=<optimized out>, bf=<optimized out>, cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:6331
#12 vm_exec_core (ec=0x7f76f2c0b2d0, initial=1) at insns.def:1501
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f7704e63000 (LWP 65800))]
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffd3c6cd3e0, clockid=<optimized out>, expected=0, futex_word=0x7f7704a4210c) at ../sysdeps/nptl/futex-internal.h:320
320     ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffd3c6cd3e0, clockid=<optimized out>, expected=0, futex_word=0x7f7704a4210c) at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7ffd3c6cd3e0, clockid=<optimized out>, mutex=0x7f7704a7c158, cond=0x7f7704a420e0) at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=cond@entry=0x7f7704a420e0, mutex=mutex@entry=0x7f7704a7c158, abstime=abstime@entry=0x7ffd3c6cd3e0) at pthread_cond_wait.c:665
#3  0x000055eececbc321 in native_cond_timedwait (abs=0x55eecf0bcda0 <abs>, mutex=0x7f7704a7c158, cond=0x7f7704a420e0) at thread_pthread.c:685
#4  do_gvl_timer (th=0x7f7704a6c000, sched=0x7f7704a7c150) at thread_pthread.c:350
#5  thread_sched_to_running_common (sched=sched@entry=0x7f7704a7c150, th=th@entry=0x7f7704a6c000) at thread_pthread.c:396
#6  0x000055eececbde41 in thread_sched_to_running (th=0x7f7704a6c000, sched=0x7f7704a7c150) at thread_pthread.c:430
#7  native_ppoll_sleep (th=th@entry=0x7f7704a6c000, rel=rel@entry=0x7ffd3c6cd530) at thread_pthread.c:2327
#8  0x000055eececbf93d in native_sleep (th=0x7f7704a6c000, rel=0x7ffd3c6cd530) at thread_pthread.c:2359
#9  0x000055eececc2e04 in sleep_hrtime (fl=2, rel=<optimized out>, th=0x7f7704a6c000) at thread.c:1325
#10 rb_thread_wait_for (time=...) at thread.c:1408

peterzhu2118 avatar Aug 29 '23 19:08 peterzhu2118

@peterzhu2118 what rdkafka-ruby version (maintainer here)?

mensfeld avatar Aug 29 '23 19:08 mensfeld

We're running 0.13.0 of rdkafka-ruby.

peterzhu2118 avatar Aug 29 '23 19:08 peterzhu2118

@peterzhu2118 also can you provide some more info:

  • what Ruby version?
  • what librdkafka version and rdkafka-ruby version?
  • can you try one of the higher level libraries (karafka / waterdrop) and see if that also presents the behaviour? I have a different finalization flow for both producer and consumer there (opinionated but proves to be extremely stable under stress)
  • does it happen for producer or consumer?
  • how does the shutdown look in your code, hence what is the finalization flow?

mensfeld avatar Aug 29 '23 19:08 mensfeld

Btw to not spam librdkafka as it may be on our side, can you open an issue there and just transfer the questions/answers? Once we conclude where the issue lies, we could optionally continue it here. Last few issues were indeed on our side and two not :rofl: so I don't want to point fingers at anyone (including myself).

mensfeld avatar Aug 29 '23 19:08 mensfeld

@mensfeld I opened appsignal/rdkafka-ruby#278. I debugged this further and I think appsignal/rdkafka-ruby#277 might fix the issue.

peterzhu2118 avatar Aug 29 '23 20:08 peterzhu2118

@emasab this issue can be closed:

  1. I hunted down all the issues coming from our side
  2. I applied fixes
  3. Any remaining issues related to librdkafka were given appropriate separate issues and based on your info they are under development
  4. The "new" reports come from mis-use of rdkafka-ruby and I can replicate them. Their operations are also against the shutdown procedures recommended for librdkafka, hence are not errors of the code itself.

mensfeld avatar Aug 30 '23 09:08 mensfeld

cc @thijsc you can close it as well :pray:

mensfeld avatar Aug 30 '23 09:08 mensfeld