rdkafka-ruby
rdkafka-ruby copied to clipboard
Chasing the last segfault
There is one more segfault that we sometimes see happening. It can happen when the Ruby process exits, which leads me to believe it has something to do with the termination order.
I tried to reproduce on Semaphore using:
require "rdkafka"
config = {
:"bootstrap.servers" => "localhost:9092",
}
Rdkafka::Config.new(config).admin.create_topic('test', 16, 1)
That however does not crash. @mensfeld do you have any other examples I could try?
There is one more segfault that we sometimes see happening.
More than one ;)
and this one crashes if you run it few times:
arafka2.0 (master)]$ bundle exec ruby test.rb
[karafka2.0 (master)]$ bundle exec ruby test.rb
[karafka2.0 (master)]$ bundle exec ruby test.rb
[karafka2.0 (master)]$ bundle exec ruby test.rb
Segmentation fault (core dumped)
[karafka2.0 (master)]$ bundle exec ruby test.rb
Segmentation fault (core dumped)
[karafka2.0 (master)]$
test.rb is exactly what you have above.
I will try to give you more in the upcoming days but running this in a loop in bash crashes it.
Thanks. I am trying to add finalizers for the consumer and admin classes too. That might help, this seems to have something to do with cleanup at exit.
@thijsc jemalloc may also cause more frequent crashes.
Oeh, got valgrind to print some useful stuff: https://appsignal.semaphoreci.com/jobs/163b1af4-7993-43c5-8f8b-4a5fa73d0936
I've improved the cleanup logic in #211, let's see if stuff still segfaults after this.
Still some stuff crashing...
I've opened https://github.com/edenhill/librdkafka/issues/3981 to see if they have an idea what we should look for next.
I saw ;) I keep track of changed in librdkafka on a daily basis ;)
When you are done with #211 feel free to ping me. I can run extensive integrations from karafka on it (ref: https://github.com/karafka/karafka/tree/master/spec/integrations). They heavily stress both those bindings and librdkafka.
I was able to extract a crash report with some C backtrace information. Something is happening in the destroy process. I checked the termination order docs again and am quite confident we're doing it right. Hopefully @edenhill can give us a hint.
@thijsc is this specific to the 1.9.2 librdkafka branch? I'm not able to repro with either main (1.9.0) or the 1.9.2 branch, on a M1 Mac at least. I'll try in a Linux env as well.
- Your extracted backtrace follows a
fork
operation and a GC finalizer (probably this one). Does it happen in other contexts too? Do you have more Ruby crash dumps available to inspect?
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_cfp_consistent_p+0x0) [0x7f3519b73c74] vm_insnhelper.c:3037
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_cfunc_with_frame) vm_insnhelper.c:3039
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_sendish+0xae) [0x7f3519b81392] vm_insnhelper.c:4751
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_exec_core) insns.def:778
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_exec+0x108) [0x7f3519b866c8] vm.c:2211
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_invoke_proc+0x3c) [0x7f3519b8a1bc] vm.c:1521
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call0_body+0x746) [0x7f3519b8a946] vm_eval.c:273
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_call0+0xae) [0x7f3519b8aeae] vm_eval.c:57
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_check_funcall_default_kw+0x1f2) [0x7f3519b8c022] vm_eval.c:690
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(run_finalizer+0x171) [0x7f35199c59d1] gc.c:3958
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(run_final+0x3b) [0x7f35199cc84b] gc.c:4020
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_multi_ractor_p) gc.c:4034
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_lock_enter) vm_sync.h:74
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(finalize_list) gc.c:4036
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rbimpl_atomic_size_exchange+0x0) [0x7f35199d3198] gc.c:4066
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rbimpl_atomic_value_exchange) ./include/ruby/atomic.h:745
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(finalize_deferred) gc.c:4065
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_objspace_call_finalizer) gc.c:4123
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_ec_cleanup+0x2f2) [0x7f35199b3fd2] eval.c:164
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(ruby_stop+0x9) [0x7f35199b4179] eval.c:291
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_f_fork+0xae) [0x7f3519a98b5e] process.c:4407
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_cfp_consistent_p+0x0) [0x7f3519b73c74] vm_insnhelper.c:3037
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_cfunc_with_frame) vm_insnhelper.c:3039
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_method_each_type+0xae) [0x7f3519b8f5be] vm_insnhelper.c:3639
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_method+0xeb) [0x7f3519b8fdfb] vm_insnhelper.c:3750
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_sendish+0xa7) [0x7f3519b82e85] vm_insnhelper.c:4751
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_exec_core) insns.def:759
- Can also enable
GC.stress = true
, but it hasn't been helpful in the producer integration type tests as timeouts everywhere :smile:
This has been happening on other versions too. I haven't seen it crash on Mac, it might be Linux specific.
And only ever in the context of a fork
?
@methodmissing
Does it happen in other contexts too?
Around finalization / closing of consumer/producer mostly and on Linux.
I haven't been able to crash the build again with the changes in #211. I've merged it, it's def better than before at least. I've pushed 0.13.0.beta.1
with all the latest changes. @mensfeld could you run this one through your integration test?
I could but you changes so many things that now my patches do not work :sweat:
Oh which changes broke things?
Ah probably the client refactor. Sorry about the hassle.
It's ok. The fix was simple. I will show you the PR once I roll it out. So far with minor patch my specs are working well (though still running).
The core dump is picky with exact symbols (compiled 3.1.2 from source, no cigar). Could repro local on a Linux laptop with this in the project root:
-
gdb -q
which ruby`` -
(gdb) run
which bundleexec rspec
raises an exception if produce is called
[New Thread 0x7ffeccff9700 (LWP 115058)]
[New Thread 0x7ffecdffb700 (LWP 115059)]
[New Thread 0x7ffecd7fa700 (LWP 115060)]
[Thread 0x7ffecd7fa700 (LWP 115060) exited]
[Thread 0x7ffecdffb700 (LWP 115059) exited]
[Thread 0x7ffeccff9700 (LWP 115058) exited]
[New Thread 0x7ffeccff9700 (LWP 115061)]
[New Thread 0x7ffecdffb700 (LWP 115062)]
[New Thread 0x7ffecd7fa700 (LWP 115063)]
[New Thread 0x7ffec7fef700 (LWP 115064)]
[Thread 0x7ffec7fef700 (LWP 115064) exited]
[Thread 0x7ffecd7fa700 (LWP 115063) exited]
[Thread 0x7ffecdffb700 (LWP 115062) exited]
[Thread 0x7ffeccff9700 (LWP 115061) exited]
raises an exception if partition_count is called
Pending: (Failures listed here are expected and do not affect your suite's status)
1) Rdkafka::Config configuration should allow configuring zstd compression
# Zstd compression not supported on this machine
Failure/Error: raise ConfigError.new(error_buffer.read_string)
Rdkafka::Config::ConfigError:
Unsupported value "zstd" for configuration property "compression.codec": libzstd not available at build time
# ./lib/rdkafka/config.rb:228:in `block (2 levels) in native_config'
# ./lib/rdkafka/config.rb:218:in `each'
# ./lib/rdkafka/config.rb:218:in `block in native_config'
# ./lib/rdkafka/config.rb:216:in `native_config'
# ./lib/rdkafka/config.rb:179:in `producer'
# ./spec/rdkafka/config_spec.rb:168:in `block (3 levels) in <top (required)>'
# ./spec/spec_helper.rb:138:in `block (3 levels) in <top (required)>'
# ./spec/spec_helper.rb:137:in `block (2 levels) in <top (required)>'
Finished in 3 minutes 51.6 seconds (files took 0.5495 seconds to load)
246 examples, 0 failures, 1 pending
Coverage report generated for RSpec to /home/lourens/src/rdkafka-ruby/coverage. 993 / 998 LOC (99.5%) covered.
--Type <RET> for more, q to quit, c to continue without paging--
Thread 621 "rdk:broker1" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff7effd700 (LWP 114429)]
0x00007ffff39601e6 in ?? ()
(gdb)
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff7b4eb80 (LWP 106050) "ruby" _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
2 Thread 0x7ffff3571700 (LWP 106086) "FFI Callback Di" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff3570e50, clockid=<optimized out>, expected=0, futex_word=0x7ffff3570e88) at ../sysdeps/nptl/futex-internal.h:320
3 Thread 0x7ffff3370700 (LWP 106087) "config.rb:21" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff336fe50, clockid=<optimized out>, expected=0, futex_word=0x7ffff336fe88) at ../sysdeps/nptl/futex-internal.h:320
8 Thread 0x7ffff10aa700 (LWP 106114) "" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff10a9e50, clockid=<optimized out>, expected=0, futex_word=0x7ffff10a9e88) at ../sysdeps/nptl/futex-internal.h:320
9 Thread 0x7ffff0e6d700 (LWP 106120) "native_kafka.r*" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff0e6ce50, clockid=<optimized out>, expected=0, futex_word=0x7ffff0e6ce88) at ../sysdeps/nptl/futex-internal.h:320
14 Thread 0x7ffff0c6c700 (LWP 106130) "native_kafka.r*" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff0c6be50, clockid=<optimized out>, expected=0, futex_word=0x7ffff0c6be88) at ../sysdeps/nptl/futex-internal.h:320
70 Thread 0x7ffff18ab700 (LWP 106267) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff18aacb0, clockid=<optimized out>, expected=0, futex_word=0x555556e919d0) at ../sysdeps/nptl/futex-internal.h:320
71 Thread 0x7ffff30ae700 (LWP 106268) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff30ad890, clockid=<optimized out>, expected=0, futex_word=0x555556e5c680) at ../sysdeps/nptl/futex-internal.h:320
72 Thread 0x7ffff20ac700 (LWP 106269) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff20abcb0, clockid=<optimized out>, expected=0, futex_word=0x555556f854a0) at ../sysdeps/nptl/futex-internal.h:320
73 Thread 0x7ffff28ad700 (LWP 106270) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff28ac890, clockid=<optimized out>, expected=0, futex_word=0x555556fc0f80) at ../sysdeps/nptl/futex-internal.h:320
487 Thread 0x7fffa77fe700 (LWP 112631) "rdk:broker-1" 0x00007ffff7c629cf in __GI___poll (fds=0x7fff88005058, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
488 Thread 0x7fffa6ffd700 (LWP 112632) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa6ffccb0, clockid=<optimized out>, expected=0, futex_word=0x555556e34670) at ../sysdeps/nptl/futex-internal.h:320
489 Thread 0x7fffa67fc700 (LWP 112633) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa67fb890, clockid=<optimized out>, expected=0, futex_word=0x555556f660e0) at ../sysdeps/nptl/futex-internal.h:320
500 Thread 0x7fffa57fa700 (LWP 112655) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa57f9890, clockid=<optimized out>, expected=0, futex_word=0x555556f0bfd0) at ../sysdeps/nptl/futex-internal.h:320
501 Thread 0x7fff7f7fe700 (LWP 112656) "rdk:broker1" 0x00007ffff7c629cf in __GI___poll (fds=0x7fff70005058, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
619 Thread 0x7fff7e7fc700 (LWP 114427) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fff7e7fbcb0, clockid=<optimized out>, expected=0, futex_word=0x555557113690) at ../sysdeps/nptl/futex-internal.h:320
620 Thread 0x7fff7dffb700 (LWP 114428) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fff7dffa890, clockid=<optimized out>, expected=0, futex_word=0x5555570f7fc0) at ../sysdeps/nptl/futex-internal.h:320
* 621 Thread 0x7fff7effd700 (LWP 114429) "rdk:broker1" 0x00007ffff39601e6 in ?? ()
compiling librdkafka with debug symbols and rerunning
GC finalizer context on the Ruby side:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b4eb80 (LWP 106050))]
#0 _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
815 dl-close.c: No such file or directory.
(gdb) bt
#0 _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
#1 0x00007ffff7fe6ef5 in _dl_close_worker (force=false, map=0x555556b627a0) at dl-close.c:145
#2 _dl_close (_map=0x555556b627a0) at dl-close.c:859
#3 0x00007ffff7cb0838 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd910, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208
#4 0x00007ffff7cb0903 in __GI__dl_catch_error (objname=0x555555a663d0, errstring=0x555555a663d8, mallocedp=0x555555a663c8, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227
#5 0x00007ffff7ecfb59 in _dlerror_run (operate=operate@entry=0x7ffff7ecf420 <dlclose_doit>, args=0x555556b627a0) at dlerror.c:170
#6 0x00007ffff7ecf468 in __dlclose (handle=<optimized out>) at dlclose.c:46
#7 0x00007ffff3bb5385 in library_free (library=0x555556b5a4d0) at DynamicLibrary.c:171 <<<<<<<<<<<<
#8 0x00005555555a4a1d in run_final (zombie=140737281867360, objspace=0x5555559a2040) at gc.c:4015 <<<<<<<<<
#9 finalize_list (objspace=0x5555559a2040, zombie=140737281867360) at gc.c:4034
#10 0x00005555556e1ac0 in ?? () at ./include/ruby/internal/arithmetic/int.h:169
#11 0x00007fffffffd9a4 in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x00007ffff3b2ce60 in ?? ()
#14 0x0000000000000000 in ?? ()
Another variation:
* 702 Thread 0x7ffecc7f8700 (LWP 190988) "rdk:broker-1" 0x00007ffff39601e6 in ?? ()
703 Thread 0x7ffecbff7700 (LWP 190989) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffecbff6cb0, clockid=<optimized out>, expected=0, futex_word=0x55555731aa94) at ../sysdeps/nptl/futex-internal.h:320
704 Thread 0x7ffecb7f6700 (LWP 190990) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffecb7f5890, clockid=<optimized out>, expected=0, futex_word=0x555557320090) at ../sysdeps/nptl/futex-internal.h:320
705 Thread 0x7ffecaff5700 (LWP 190991) "rdk:broker1" 0x00007ffff7c629cf in __GI___poll (fds=0x7fff68005e28, nfds=2, timeout=759) at ../sysdeps/unix/sysv/linux/poll.c:29
713 Thread 0x7ffeca7f4700 (LWP 191002) "rdk:broker-1" 0x00007ffff7c629cf in __GI___poll (fds=0x7fff7000b798, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
714 Thread 0x7ffec9ff3700 (LWP 191003) "rdk:main" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffec9ff2cb0, clockid=<optimized out>, expected=0, futex_word=0x5555572ba120) at ../sysdeps/nptl/futex-internal.h:320
715 Thread 0x7ffec97f2700 (LWP 191004) "rdk:broker-1" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffec97f1890, clockid=<optimized out>, expected=0, futex_word=0x555556f39c20) at ../sysdeps/nptl/futex-internal.h:320
716 Thread 0x7ffec8ff1700 (LWP 191005) "rdk:broker1" 0x00007ffff7c629cf in __GI___poll (fds=0x7fff88008e98, nfds=2, timeout=851) at ../sysdeps/unix/sysv/linux/poll.c:29
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b4eb80 (LWP 190230))]
#0 0x00007ffff7c689ab in munmap () at ../sysdeps/unix/syscall-template.S:78
78 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0 0x00007ffff7c689ab in munmap () at ../sysdeps/unix/syscall-template.S:78
#1 0x000055555559f7bd in rb_aligned_free (size=16384, ptr=<optimized out>) at gc.c:11207
#2 heap_page_free (objspace=0x5555559a1040, objspace=0x5555559a1040, page=0x555556a591d0) at gc.c:1912
#3 rb_objspace_free (objspace=objspace@entry=0x5555559a1040) at gc.c:1764
#4 0x0000555555785618 in ruby_vm_destruct (vm=0x55555599e9d0) at vm.c:2722
#5 0x0000555555587c55 in rb_ec_cleanup (ec=ec@entry=0x5555559a17e0, ex0=<optimized out>) at eval.c:264
#6 0x00005555555884c9 in ruby_run_node (n=0x7ffff3f98a58) at eval.c:321
#7 0x000055555558304f in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:47
I'm trying some stuff here: #223
Just noticed config.rb:21
is in one of the traces. I'm wondering if the change in #97 could be related.
After reverting #97 and trying that branch in our staging I still see a segfault on exit every now and then, so has to be something else I guess.
Some progress, with this spec it segfaults every build, instead of sometimes.
Once in a while I see this as well:
1851dd080809 (stopping): ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_cond_lock: Assertion `mutex->__data.__owner == 0' failed.
ref https://github.com/karafka/karafka/actions/runs/3313474174/jobs/5471508194
Useful to know @mensfeld. I tried skipping the destroy, still segfaulting then. I'm sort of at my wits end here, it seems like we're doing everything right but it still happens.
I will look into this as well at the end of next week. We can also ask @ioquatix for help - he helped me several times with C/FFI/GC debugging.
That'd be great! We could def use help from an expert.