rdkafka-ruby icon indicating copy to clipboard operation
rdkafka-ruby copied to clipboard

Chasing the last segfault

Open thijsc opened this issue 2 years ago • 8 comments

There is one more segfault that we sometimes see happening. It can happen when the Ruby process exits, which leads me to believe it has something to do with the termination order.

I tried to reproduce on Semaphore using:

require "rdkafka"

config = {
  :"bootstrap.servers" => "localhost:9092",
}

Rdkafka::Config.new(config).admin.create_topic('test', 16, 1)

That however does not crash. @mensfeld do you have any other examples I could try?

thijsc avatar Sep 15 '22 14:09 thijsc

There is one more segfault that we sometimes see happening.

More than one ;)

and this one crashes if you run it few times:

arafka2.0 (master)]$ bundle exec ruby test.rb 
[karafka2.0 (master)]$ bundle exec ruby test.rb 
[karafka2.0 (master)]$ bundle exec ruby test.rb 
[karafka2.0 (master)]$ bundle exec ruby test.rb 
Segmentation fault (core dumped)
[karafka2.0 (master)]$ bundle exec ruby test.rb 
Segmentation fault (core dumped)
[karafka2.0 (master)]$ 

test.rb is exactly what you have above.

I will try to give you more in the upcoming days but running this in a loop in bash crashes it.

mensfeld avatar Sep 15 '22 17:09 mensfeld

Thanks. I am trying to add finalizers for the consumer and admin classes too. That might help, this seems to have something to do with cleanup at exit.

thijsc avatar Sep 16 '22 09:09 thijsc

@thijsc jemalloc may also cause more frequent crashes.

mensfeld avatar Sep 16 '22 12:09 mensfeld

Oeh, got valgrind to print some useful stuff: https://appsignal.semaphoreci.com/jobs/163b1af4-7993-43c5-8f8b-4a5fa73d0936

thijsc avatar Sep 16 '22 14:09 thijsc

I've improved the cleanup logic in #211, let's see if stuff still segfaults after this.

thijsc avatar Sep 16 '22 15:09 thijsc

Still some stuff crashing...

thijsc avatar Sep 16 '22 15:09 thijsc

I've opened https://github.com/edenhill/librdkafka/issues/3981 to see if they have an idea what we should look for next.

thijsc avatar Sep 16 '22 15:09 thijsc

I saw ;) I keep track of changed in librdkafka on a daily basis ;)

When you are done with #211 feel free to ping me. I can run extensive integrations from karafka on it (ref: https://github.com/karafka/karafka/tree/master/spec/integrations). They heavily stress both those bindings and librdkafka.

mensfeld avatar Sep 17 '22 15:09 mensfeld

I was able to extract a crash report with some C backtrace information. Something is happening in the destroy process. I checked the termination order docs again and am quite confident we're doing it right. Hopefully @edenhill can give us a hint.

thijsc avatar Sep 30 '22 13:09 thijsc

@thijsc is this specific to the 1.9.2 librdkafka branch? I'm not able to repro with either main (1.9.0) or the 1.9.2 branch, on a M1 Mac at least. I'll try in a Linux env as well.

  • Your extracted backtrace follows a fork operation and a GC finalizer (probably this one). Does it happen in other contexts too? Do you have more Ruby crash dumps available to inspect?
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_cfp_consistent_p+0x0) [0x7f3519b73c74] vm_insnhelper.c:3037
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_cfunc_with_frame) vm_insnhelper.c:3039
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_sendish+0xae) [0x7f3519b81392] vm_insnhelper.c:4751
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_exec_core) insns.def:778
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_exec+0x108) [0x7f3519b866c8] vm.c:2211
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_invoke_proc+0x3c) [0x7f3519b8a1bc] vm.c:1521
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call0_body+0x746) [0x7f3519b8a946] vm_eval.c:273
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_call0+0xae) [0x7f3519b8aeae] vm_eval.c:57
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_check_funcall_default_kw+0x1f2) [0x7f3519b8c022] vm_eval.c:690
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(run_finalizer+0x171) [0x7f35199c59d1] gc.c:3958
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(run_final+0x3b) [0x7f35199cc84b] gc.c:4020
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_multi_ractor_p) gc.c:4034
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_vm_lock_enter) vm_sync.h:74
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(finalize_list) gc.c:4036
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rbimpl_atomic_size_exchange+0x0) [0x7f35199d3198] gc.c:4066
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rbimpl_atomic_value_exchange) ./include/ruby/atomic.h:745
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(finalize_deferred) gc.c:4065
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_objspace_call_finalizer) gc.c:4123
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_ec_cleanup+0x2f2) [0x7f35199b3fd2] eval.c:164
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(ruby_stop+0x9) [0x7f35199b4179] eval.c:291
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(rb_f_fork+0xae) [0x7f3519a98b5e] process.c:4407
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_cfp_consistent_p+0x0) [0x7f3519b73c74] vm_insnhelper.c:3037
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_cfunc_with_frame) vm_insnhelper.c:3039
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_method_each_type+0xae) [0x7f3519b8f5be] vm_insnhelper.c:3639
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_call_method+0xeb) [0x7f3519b8fdfb] vm_insnhelper.c:3750
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_sendish+0xa7) [0x7f3519b82e85] vm_insnhelper.c:4751
/home/semaphore/.rbenv/versions/3.1.2/lib/libruby.so.3.1(vm_exec_core) insns.def:759
  • Can also enable GC.stress = true, but it hasn't been helpful in the producer integration type tests as timeouts everywhere :smile:

methodmissing avatar Oct 04 '22 11:10 methodmissing

This has been happening on other versions too. I haven't seen it crash on Mac, it might be Linux specific.

thijsc avatar Oct 04 '22 11:10 thijsc

And only ever in the context of a fork?

methodmissing avatar Oct 04 '22 11:10 methodmissing

@methodmissing

Does it happen in other contexts too?

Around finalization / closing of consumer/producer mostly and on Linux.

mensfeld avatar Oct 04 '22 11:10 mensfeld

I haven't been able to crash the build again with the changes in #211. I've merged it, it's def better than before at least. I've pushed 0.13.0.beta.1 with all the latest changes. @mensfeld could you run this one through your integration test?

thijsc avatar Oct 11 '22 13:10 thijsc

I could but you changes so many things that now my patches do not work :sweat:

mensfeld avatar Oct 11 '22 14:10 mensfeld

Oh which changes broke things?

thijsc avatar Oct 11 '22 14:10 thijsc

Ah probably the client refactor. Sorry about the hassle.

thijsc avatar Oct 11 '22 14:10 thijsc

It's ok. The fix was simple. I will show you the PR once I roll it out. So far with minor patch my specs are working well (though still running).

mensfeld avatar Oct 11 '22 14:10 mensfeld

After much re-triggering I've managed to get a core dump from CI:

core.zip

thijsc avatar Oct 11 '22 19:10 thijsc

The core dump is picky with exact symbols (compiled 3.1.2 from source, no cigar). Could repro local on a Linux laptop with this in the project root:

  • gdb -q which ruby``
  • (gdb) run which bundle exec rspec
    raises an exception if produce is called
[New Thread 0x7ffeccff9700 (LWP 115058)]
[New Thread 0x7ffecdffb700 (LWP 115059)]
[New Thread 0x7ffecd7fa700 (LWP 115060)]
[Thread 0x7ffecd7fa700 (LWP 115060) exited]
[Thread 0x7ffecdffb700 (LWP 115059) exited]
[Thread 0x7ffeccff9700 (LWP 115058) exited]
[New Thread 0x7ffeccff9700 (LWP 115061)]
[New Thread 0x7ffecdffb700 (LWP 115062)]
[New Thread 0x7ffecd7fa700 (LWP 115063)]
[New Thread 0x7ffec7fef700 (LWP 115064)]
[Thread 0x7ffec7fef700 (LWP 115064) exited]
[Thread 0x7ffecd7fa700 (LWP 115063) exited]
[Thread 0x7ffecdffb700 (LWP 115062) exited]
[Thread 0x7ffeccff9700 (LWP 115061) exited]
    raises an exception if partition_count is called

Pending: (Failures listed here are expected and do not affect your suite's status)

  1) Rdkafka::Config configuration should allow configuring zstd compression
     # Zstd compression not supported on this machine
     Failure/Error: raise ConfigError.new(error_buffer.read_string)
     
     Rdkafka::Config::ConfigError:
       Unsupported value "zstd" for configuration property "compression.codec": libzstd not available at build time
     # ./lib/rdkafka/config.rb:228:in `block (2 levels) in native_config'
     # ./lib/rdkafka/config.rb:218:in `each'
     # ./lib/rdkafka/config.rb:218:in `block in native_config'
     # ./lib/rdkafka/config.rb:216:in `native_config'
     # ./lib/rdkafka/config.rb:179:in `producer'
     # ./spec/rdkafka/config_spec.rb:168:in `block (3 levels) in <top (required)>'
     # ./spec/spec_helper.rb:138:in `block (3 levels) in <top (required)>'
     # ./spec/spec_helper.rb:137:in `block (2 levels) in <top (required)>'

Finished in 3 minutes 51.6 seconds (files took 0.5495 seconds to load)
246 examples, 0 failures, 1 pending

Coverage report generated for RSpec to /home/lourens/src/rdkafka-ruby/coverage. 993 / 998 LOC (99.5%) covered.
--Type <RET> for more, q to quit, c to continue without paging--

Thread 621 "rdk:broker1" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff7effd700 (LWP 114429)]
0x00007ffff39601e6 in ?? ()
(gdb) 
(gdb) info threads
  Id   Target Id                                            Frame 
  1    Thread 0x7ffff7b4eb80 (LWP 106050) "ruby"            _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
  2    Thread 0x7ffff3571700 (LWP 106086) "FFI Callback Di" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff3570e50, clockid=<optimized out>, expected=0, futex_word=0x7ffff3570e88) at ../sysdeps/nptl/futex-internal.h:320
  3    Thread 0x7ffff3370700 (LWP 106087) "config.rb:21"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff336fe50, clockid=<optimized out>, expected=0, futex_word=0x7ffff336fe88) at ../sysdeps/nptl/futex-internal.h:320
  8    Thread 0x7ffff10aa700 (LWP 106114) ""                futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff10a9e50, clockid=<optimized out>, expected=0, futex_word=0x7ffff10a9e88) at ../sysdeps/nptl/futex-internal.h:320
  9    Thread 0x7ffff0e6d700 (LWP 106120) "native_kafka.r*" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff0e6ce50, clockid=<optimized out>, expected=0, futex_word=0x7ffff0e6ce88) at ../sysdeps/nptl/futex-internal.h:320
  14   Thread 0x7ffff0c6c700 (LWP 106130) "native_kafka.r*" futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff0c6be50, clockid=<optimized out>, expected=0, futex_word=0x7ffff0c6be88) at ../sysdeps/nptl/futex-internal.h:320
  70   Thread 0x7ffff18ab700 (LWP 106267) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff18aacb0, clockid=<optimized out>, expected=0, futex_word=0x555556e919d0) at ../sysdeps/nptl/futex-internal.h:320
  71   Thread 0x7ffff30ae700 (LWP 106268) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff30ad890, clockid=<optimized out>, expected=0, futex_word=0x555556e5c680) at ../sysdeps/nptl/futex-internal.h:320
  72   Thread 0x7ffff20ac700 (LWP 106269) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff20abcb0, clockid=<optimized out>, expected=0, futex_word=0x555556f854a0) at ../sysdeps/nptl/futex-internal.h:320
  73   Thread 0x7ffff28ad700 (LWP 106270) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffff28ac890, clockid=<optimized out>, expected=0, futex_word=0x555556fc0f80) at ../sysdeps/nptl/futex-internal.h:320
  487  Thread 0x7fffa77fe700 (LWP 112631) "rdk:broker-1"    0x00007ffff7c629cf in __GI___poll (fds=0x7fff88005058, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
  488  Thread 0x7fffa6ffd700 (LWP 112632) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa6ffccb0, clockid=<optimized out>, expected=0, futex_word=0x555556e34670) at ../sysdeps/nptl/futex-internal.h:320
  489  Thread 0x7fffa67fc700 (LWP 112633) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa67fb890, clockid=<optimized out>, expected=0, futex_word=0x555556f660e0) at ../sysdeps/nptl/futex-internal.h:320
  500  Thread 0x7fffa57fa700 (LWP 112655) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffa57f9890, clockid=<optimized out>, expected=0, futex_word=0x555556f0bfd0) at ../sysdeps/nptl/futex-internal.h:320
  501  Thread 0x7fff7f7fe700 (LWP 112656) "rdk:broker1"     0x00007ffff7c629cf in __GI___poll (fds=0x7fff70005058, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
  619  Thread 0x7fff7e7fc700 (LWP 114427) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fff7e7fbcb0, clockid=<optimized out>, expected=0, futex_word=0x555557113690) at ../sysdeps/nptl/futex-internal.h:320
  620  Thread 0x7fff7dffb700 (LWP 114428) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fff7dffa890, clockid=<optimized out>, expected=0, futex_word=0x5555570f7fc0) at ../sysdeps/nptl/futex-internal.h:320
* 621  Thread 0x7fff7effd700 (LWP 114429) "rdk:broker1"     0x00007ffff39601e6 in ?? ()

compiling librdkafka with debug symbols and rerunning

methodmissing avatar Oct 11 '22 21:10 methodmissing

GC finalizer context on the Ruby side:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b4eb80 (LWP 106050))]
#0  _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
815	dl-close.c: No such file or directory.
(gdb) bt
#0  _dl_close_worker (map=map@entry=0x555556b627a0, force=force@entry=false) at dl-close.c:815
#1  0x00007ffff7fe6ef5 in _dl_close_worker (force=false, map=0x555556b627a0) at dl-close.c:145
#2  _dl_close (_map=0x555556b627a0) at dl-close.c:859
#3  0x00007ffff7cb0838 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd910, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208
#4  0x00007ffff7cb0903 in __GI__dl_catch_error (objname=0x555555a663d0, errstring=0x555555a663d8, mallocedp=0x555555a663c8, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227
#5  0x00007ffff7ecfb59 in _dlerror_run (operate=operate@entry=0x7ffff7ecf420 <dlclose_doit>, args=0x555556b627a0) at dlerror.c:170
#6  0x00007ffff7ecf468 in __dlclose (handle=<optimized out>) at dlclose.c:46
#7  0x00007ffff3bb5385 in library_free (library=0x555556b5a4d0) at DynamicLibrary.c:171 <<<<<<<<<<<<
#8  0x00005555555a4a1d in run_final (zombie=140737281867360, objspace=0x5555559a2040) at gc.c:4015 <<<<<<<<<
#9  finalize_list (objspace=0x5555559a2040, zombie=140737281867360) at gc.c:4034
#10 0x00005555556e1ac0 in ?? () at ./include/ruby/internal/arithmetic/int.h:169
#11 0x00007fffffffd9a4 in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x00007ffff3b2ce60 in ?? ()
#14 0x0000000000000000 in ?? ()

methodmissing avatar Oct 11 '22 21:10 methodmissing

Another variation:

* 702  Thread 0x7ffecc7f8700 (LWP 190988) "rdk:broker-1"    0x00007ffff39601e6 in ?? ()
  703  Thread 0x7ffecbff7700 (LWP 190989) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffecbff6cb0, clockid=<optimized out>, expected=0, futex_word=0x55555731aa94) at ../sysdeps/nptl/futex-internal.h:320
  704  Thread 0x7ffecb7f6700 (LWP 190990) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffecb7f5890, clockid=<optimized out>, expected=0, futex_word=0x555557320090) at ../sysdeps/nptl/futex-internal.h:320
  705  Thread 0x7ffecaff5700 (LWP 190991) "rdk:broker1"     0x00007ffff7c629cf in __GI___poll (fds=0x7fff68005e28, nfds=2, timeout=759) at ../sysdeps/unix/sysv/linux/poll.c:29
  713  Thread 0x7ffeca7f4700 (LWP 191002) "rdk:broker-1"    0x00007ffff7c629cf in __GI___poll (fds=0x7fff7000b798, nfds=2, timeout=1000) at ../sysdeps/unix/sysv/linux/poll.c:29
  714  Thread 0x7ffec9ff3700 (LWP 191003) "rdk:main"        futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffec9ff2cb0, clockid=<optimized out>, expected=0, futex_word=0x5555572ba120) at ../sysdeps/nptl/futex-internal.h:320
  715  Thread 0x7ffec97f2700 (LWP 191004) "rdk:broker-1"    futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7ffec97f1890, clockid=<optimized out>, expected=0, futex_word=0x555556f39c20) at ../sysdeps/nptl/futex-internal.h:320
  716  Thread 0x7ffec8ff1700 (LWP 191005) "rdk:broker1"     0x00007ffff7c629cf in __GI___poll (fds=0x7fff88008e98, nfds=2, timeout=851) at ../sysdeps/unix/sysv/linux/poll.c:29
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b4eb80 (LWP 190230))]
#0  0x00007ffff7c689ab in munmap () at ../sysdeps/unix/syscall-template.S:78
78	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff7c689ab in munmap () at ../sysdeps/unix/syscall-template.S:78
#1  0x000055555559f7bd in rb_aligned_free (size=16384, ptr=<optimized out>) at gc.c:11207
#2  heap_page_free (objspace=0x5555559a1040, objspace=0x5555559a1040, page=0x555556a591d0) at gc.c:1912
#3  rb_objspace_free (objspace=objspace@entry=0x5555559a1040) at gc.c:1764
#4  0x0000555555785618 in ruby_vm_destruct (vm=0x55555599e9d0) at vm.c:2722
#5  0x0000555555587c55 in rb_ec_cleanup (ec=ec@entry=0x5555559a17e0, ex0=<optimized out>) at eval.c:264
#6  0x00005555555884c9 in ruby_run_node (n=0x7ffff3f98a58) at eval.c:321
#7  0x000055555558304f in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:47

methodmissing avatar Oct 12 '22 10:10 methodmissing

I'm trying some stuff here: #223

thijsc avatar Oct 23 '22 14:10 thijsc

Just noticed config.rb:21 is in one of the traces. I'm wondering if the change in #97 could be related.

thijsc avatar Oct 23 '22 19:10 thijsc

After reverting #97 and trying that branch in our staging I still see a segfault on exit every now and then, so has to be something else I guess.

thijsc avatar Oct 23 '22 20:10 thijsc

Some progress, with this spec it segfaults every build, instead of sometimes.

thijsc avatar Oct 24 '22 13:10 thijsc

Once in a while I see this as well:

 1851dd080809 (stopping): ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_cond_lock: Assertion `mutex->__data.__owner == 0' failed.

ref https://github.com/karafka/karafka/actions/runs/3313474174/jobs/5471508194

mensfeld avatar Oct 24 '22 14:10 mensfeld

Useful to know @mensfeld. I tried skipping the destroy, still segfaulting then. I'm sort of at my wits end here, it seems like we're doing everything right but it still happens.

thijsc avatar Oct 24 '22 14:10 thijsc

I will look into this as well at the end of next week. We can also ask @ioquatix for help - he helped me several times with C/FFI/GC debugging.

mensfeld avatar Oct 24 '22 14:10 mensfeld

That'd be great! We could def use help from an expert.

thijsc avatar Oct 24 '22 14:10 thijsc