go-ceph
go-ceph copied to clipboard
TestPingMonitor crashes
TestPingMonitor has a race condition that often causes a segmentation fault.
First I thought this is again because of our connection finalizer, but disabling the finalizer does not make that issue disappear.
(gdb) bt
#0 runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
#1 0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7f5c0e5e5f90)
at /opt/go/src/runtime/signal_unix.go:967
#2 0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7f5c0e5e5f90)
at /opt/go/src/runtime/signal_unix.go:1076
#3 0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7f5c0e5e6130,
ctx=0x7f5c0e5e6000) at /opt/go/src/runtime/signal_unix.go:468
#4 0x0000000000478a66 in runtime.sigtramp ()
at /opt/go/src/runtime/sys_linux_amd64.s:352
#5 <signal handler called>
#6 MonConnection::get_auth_request (this=0x0, method=0x7f5c04000ed0,
preferred_modes=0x7f5c0e5e6680, bl=0x7f5c0e5e66a0, entity_name=...,
want_keys=0, keyring=0x7ffe25ab3a30)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702
#7 0x00007f5c721375eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) ()
from /usr/lib64/ceph/libceph-common.so.2
#8 0x00007f5c720aa672 in ProtocolV2::send_auth_request (this=0x1cd2670,
allowed_methods=...)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692
#9 0x00007f5c720aaddc in ProtocolV2::send_auth_request (this=0x1cd2670)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217
#10 ProtocolV2::post_client_banner_exchange (this=0x1cd2670)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680
#11 0x00007f5c720a462c in ProtocolV2::run_continuation (this=0x1cd2670,
continuation=...)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#12 0x00007f5c7206cae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>,
this=0x7f5bfc031590) at /usr/include/c++/8/bits/std_function.h:682
#13 AsyncConnection::process (this=0x7f5bfc0311f0)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454
#14 0x00007f5c720c6c87 in EventCenter::process_events (
this=this@entry=0x125e370, timeout_microseconds=<optimized out>,
timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7f5c0e5e6ee8)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165
#15 0x00007f5c720cd19c in NetworkStack::<lambda()>::operator() (
__closure=0x125b158, __closure=0x125b158)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/8/bits/std_function.h:297
#17 0x00007f5c703fab23 in execute_native_thread_routine ()
from /lib64/libstdc++.so.6
#18 0x00007f5c7afdc1ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007f5c7aa30e73 in clone () from /lib64/libc.so.6
from CI runner, seems identical:
Core was generated by `./rados.test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
154 RET
[Current thread is 1 (Thread 0x7ff5aeffe700 (LWP 6184))]
To enable execution of this file add
add-auto-load-safe-path /opt/go/src/runtime/runtime-gdb.py
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
#0 runtime.raise () at /opt/go/src/runtime/sys_linux_amd64.s:154
#1 0x00000000004585db in runtime.raisebadsignal (sig=11, c=0x7ff5aeffa850)
at /opt/go/src/runtime/signal_unix.go:967
#2 0x0000000000458a27 in runtime.badsignal (sig=11, c=0x7ff5aeffa850)
at /opt/go/src/runtime/signal_unix.go:1076
#3 0x0000000000457368 in runtime.sigtrampgo (sig=11, info=0x7ff5aeffa9f0,
ctx=0x7ff5aeffa8c0) at /opt/go/src/runtime/signal_unix.go:468
#4 0x0000000000478a66 in runtime.sigtramp ()
at /opt/go/src/runtime/sys_linux_amd64.s:352
#5 <signal handler called>
#6 MonConnection::get_auth_request (this=0x0, method=0x17257a0,
preferred_modes=0x7ff5aeffb680, bl=0x7ff5aeffb6a0, entity_name=...,
want_keys=0, keyring=0x7ffd722b3130)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/mon/MonClient.cc:1702
#7 0x00007ff6147305eb in non-virtual thunk to MonClientPinger::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*) ()
from /usr/lib64/ceph/libceph-common.so.2
#8 0x00007ff6146a3672 in ProtocolV2::send_auth_request (this=0x2264020,
allowed_methods=...)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1692
#9 0x00007ff6146a3ddc in ProtocolV2::send_auth_request (this=0x2264020)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.h:217
#10 ProtocolV2::post_client_banner_exchange (this=0x2264020)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:1680
#11 0x00007ff61469d62c in ProtocolV2::run_continuation (this=0x2264020,
continuation=...)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
#12 0x00007ff614665ae9 in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>,
this=0x2261fd0) at /usr/include/c++/8/bits/std_function.h:682
#13 AsyncConnection::process (this=0x2261c30)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/AsyncConnection.cc:454
#14 0x00007ff6146bfc87 in EventCenter::process_events (
this=this@entry=0x17f15e0, timeout_microseconds=<optimized out>,
timeout_microseconds@entry=30000000,
working_dur=working_dur@entry=0x7ff5aeffbee8)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/build/boost/include/boost/container/new_allocator.hpp:165
#15 0x00007ff6146c619c in NetworkStack::<lambda()>::operator() (
__closure=0x171f938, __closure=0x171f938)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/Stack.cc:53
#16 std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/8/bits/std_function.h:297
#17 0x00007ff6129f3b23 in execute_native_thread_routine ()
from /lib64/libstdc++.so.6
#18 0x00007ff61d5d51ca in start_thread () from /lib64/libpthread.so.0
#19 0x00007ff61d029e73 in clone () from /lib64/libc.so.6
I could reproduce it with a pure C loop like this:
inline int test_ping(rados_t c) {
char* outstr;
size_t outlen;
int ret;
for (int i=0; i<10000; ++i) {
printf("X%d ", i);
ret = rados_ping_monitor(c, "a", &outstr, &outlen);
rados_buffer_free(outstr);
}
return ret;
}
which I guess is enough proof, that this is an upstream issue.
The MonConnection pointer (mc) in the MonClientPinger object, that comes in as an event, is 0x0. It's unclear if it is not yet set, or has been reset to 0x0:
(gdb) fr 11
#11 0x00007f99544cd62c in ProtocolV2::run_continuation (this=0x1e89930,
continuation=...)
at /usr/src/debug/ceph-16.2.14-0.el8.x86_64/src/msg/async/ProtocolV2.cc:47
47 CONTINUATION_RUN(continuation)
(gdb) print ((MonClientPinger*)(messenger->auth_client)).mc
$1 = std::unique_ptr<MonConnection> = {get() = 0x0}
Sounds like a good enough find to file a tracker at ceph to me! Please link it here once you've filed it.
https://tracker.ceph.com/issues/63008