brpc icon indicating copy to clipboard operation
brpc copied to clipboard

brpc::Acceptor::StartAccept和brpc::Acceptor::BeforeRecycle之间构成死锁

Open weingithub opened this issue 3 years ago • 4 comments

Describe the bug (描述bug) 创建子进程,然后在子进程中调用start_brpc_server 接口,之后出现brpc::Acceptor::StartAccept和brpc::Acceptor::BeforeRecycle之间构成死锁,curl访问该监听端口,卡住。详情见如下堆栈

To Reproduce (复现方法) 1.创建子进程,然后在子进程中调用start_brpc_server 接口 2.杀掉子进程,父进程会有个监听线程,监听到子进程挂掉之后,又拉起子进程(之后会重复步骤1的过程)。

Expected behavior (期望行为) 子进程启动之后,端口能正常监听

Versions (各种版本) OS: 基于linux内核3.10.0的自定义系统 Compiler: gcc 4.7 brpc: 2019年fork过去的版本 protobuf:

Additional context/screenshots (更多上下文/截图) (gdb) bt #0 0x00007fdcb176042d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fdcb175bdcb in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007fdcb175bc98 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fdcafe9650f in lock (this=0x55920555bc90) at /test/src/brpc/src/butil/synchronization/lock.h:55 #4 lock_guard (__m=..., this=) at /usr/include/c++/4.8.2/mutex:414 #5 brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325 #6 0x00007fdcafebc4ea in brpc::Socket::OnRecycle (this=0x55920ac626c0) at /test/src/brpc/src/brpc/socket.cpp:1015 #7 0x00007fdcafebccad in Dereference (this=0x238) at /test/src/brpc/src/brpc/socket_inl.h:110 #8 brpc::Socket::ReleaseAdditionalReference (this=this@entry=0x55920ac626c0) at /test/src/brpc/src/brpc/socket.cpp:783 #9 0x00007fdcafebd1ee in brpc::Socket::SetFailed (this=this@entry=0x55920ac626c0, error_code=error_code@entry=9, error_fmt=error_fmt@entry=0x7fdcb00c66c0 "Fail to ResetFileDescriptor: %s") at /test/src/brpc/src/brpc/socket.cpp:848 #10 0x00007fdcafebdcbd in brpc::Socket::Create (options=..., id=id@entry=0x55920555bc88) at /test/src/brpc/src/brpc/socket.cpp:667 #11 0x00007fdcafe96a30 in brpc::Acceptor::StartAccept (this=0x55920555bc20, listened_fd=listened_fd@entry=3, idle_timeout_sec=-1, ssl_ctx=) at /test/src/brpc/src/brpc/acceptor.cpp:82 #12 0x00007fdcafd99cd7 in brpc::Server::StartInternal (this=this@entry=0x5592009cf080, ip=..., port_range=..., opt=opt@entry=0x0) at /test/src/brpc/src/brpc/server.cpp:919 #13 0x00007fdcafd9b020 in brpc::Server::Start (this=this@entry=0x5592009cf080, endpoint=..., opt=opt@entry=0x0) at /test/src/brpc/src/brpc/server.cpp:997 #14 0x00007fdcb58061af in test::start_brpc_server (this=this@entry=0x5592009cf040) at /test//src/test/test_manager.cpp:194 #15 0x00007fdcb580626a in test::start (this=this@entry=0x5592009cf040) at /test//src/test/test_manager.cpp:106 #16 0x00007fdcb58073da in test::run (this=this@entry=0x5592009cf000) at /test//src/test/test.cpp:189 #17 0x00007fdcb5807a1f in test::start_work_process (this=this@entry=0x5592009cf000) at /test//src/test/test.cpp:177 #18 0x00007fdcb5808257 in test::daemon_thread (arg=0x5592009cf000) at /test//src/test/test.cpp:80 #19 0x00007fdcb1759e25 in start_thread () from /lib64/libpthread.so.0 #20 0x00007fdcae6d834d in clone () from /lib64/libc.so.6 (gdb) f 5 #5 brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325 325 /test/src/brpc/src/brpc/acceptor.cpp: No such file or directory. (gdb) p _map_mutex $1 = {_native_handle = {__data = {__lock = 2, __count = 0, __owner = 88919, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000W[\001\000\001", '\000' <repeats 26 times>, __align = 2}} (gdb) info thr 1 Id Target Id Frame

  • 1 Thread 0x7fdc95e0f700 (LWP 88919) "test" brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325 (gdb)

---程序运行的日志---- 2022-06-01 10:54:08.367247 - info test-5d6f30c3 W0601 10:54:08.367146 186761 socket.cpp:1219] Fail to add fd=4 into epoll: Bad file descriptor 2022-06-01 10:54:08.367722 - info test-5d6f30c3 E0601 10:54:08.367461 186746 socket.cpp:589] Fail to add SocketId=455 into EventDispatcher, fd 3 ret -1 errno 9 reason Bad file descriptor: Bad file des criptor 2022-06-01 10:54:08.367728 - info test-5d6f30c3 E0601 10:54:08.367470 186746 socket.cpp:669] Fail to ResetFileDescriptor: Bad file descriptor

当前通过日志,暂时没有找到为何epoll_ctl失败的原因。目前只能看到这个epoll_ctl失败之后导致的死锁。 @JiaoZiLang

weingithub avatar Jun 01 '22 03:06 weingithub

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

wwbmmm avatar Jun 06 '22 10:06 wwbmmm

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

按照这个代码注释: https://github.com/apache/incubator-brpc/blob/master/src/brpc/acceptor.cpp#L77 Socket::Create的期间还是需要加锁的,不能用这个方案。

换了一个方案,可以试试这个PR #1791 @weingithub

wwbmmm avatar Jun 10 '22 07:06 wwbmmm

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

按照这个代码注释: https://github.com/apache/incubator-brpc/blob/master/src/brpc/acceptor.cpp#L77 Socket::Create的期间还是需要加锁的,不能用这个方案。

换了一个方案,可以试试这个PR #1791 @weingithub

谢谢你的帮助。我看代码修改里面,改了socket的create的失败逻辑。当前的死锁问题肯定是能够解决的。不过不确定会不会在其他地方引入新的问题?我看这个接口调用的地方挺多的。

weingithub avatar Jun 13 '22 04:06 weingithub

目前继承SocketUser的有几个地方:

而该PR的逻辑就是让Create过程不回调BeforeRecycle,综上所述,该PR修复了2处潜在的double free和1处死锁,除此之外应该没有其它影响

wwbmmm avatar Jun 13 '22 05:06 wwbmmm