回调函数地址非法?
在容器运行环境有出现过几次trantor库的异常,从堆栈内存分析似乎是访问的回调函数地址为非法的,但更具体的无法确认。运行环境的网络请求很频繁且数据量很大。
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000056107971fed0 in ?? ()
[Current thread is 1 (Thread 0x7f552c07d700 (LWP 35934))]
(gdb) bt
#0 0x000056107971fed0 in ?? ()
#1 0x00007f552e60bbcc in trantor::Channel::handleEventSafely() () from /usr/local/lib/CET/libtrantor.so.1
#2 0x00007f552e60bc7f in trantor::Channel::handleEvent() () from /usr/local/lib/CET/libtrantor.so.1
#3 0x00007f552e600080 in trantor::EventLoop::loop() () from /usr/local/lib/CET/libtrantor.so.1
#4 0x00007f552e602342 in trantor::EventLoopThread::loopFuncs() () from /usr/local/lib/CET/libtrantor.so.1
#5 0x00007f55350e4b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f5534cb0fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#7 0x00007f5534dc2eff in __init_misc (argc=<optimized out>, argv=0x7f552c07d700, envp=0x561078c5c900) at init-misc.c:33
#8 0x0000000000000000 in ?? ()
下面是通过栈指针查到的信息,handleEventSafely偏移地址92我通过反汇编猜测大概是readCallback_?
(gdb) x /34a $rsp
0x7f552c07cd18: 0x7f552e60bbcc <_ZN7trantor7Channel17handleEventSafelyEv+92> 0x56107bcf3748
0x7f552c07cd28: 0x7f552e60bc7f <_ZN7trantor7Channel11handleEventEv+111> 0x7f552c07ce10
0x7f552c07cd38: 0x34e84ca07af18718 0x56107af18720
0x7f552c07cd48: 0x7f552c07ce10 0x7f552c07ce20
0x7f552c07cd58: 0x7f552e600080 <_ZN7trantor9EventLoop4loopEv+144> 0x561078a5f8d0
0x7f552c07cd68: 0x561078954208 0x561078a5f7e0
0x7f552c07cd78: 0x7f552c07cdf0 0x7f552c07ce10
0x7f552c07cd88: 0x7f552e602342 <_ZN7trantor15EventLoopThread9loopFuncsEv+626> 0x0
0x7f552c07cd98: 0x100000000000000 0x7f552c07ce10
0x7f552c07cda8: 0x561078a5f880 0x7f552c07cdd0
0x7f552c07cdb8: 0x7f552c07cd9f 0x7f552e93d510 <_ZNSt13__future_base13_State_baseV29_M_do_setEPSt8functionIFSt10unique_ptrINS_12_Result_baseENS3_8_DeleterEEvEEPb>
0x7f552c07cdc8: 0x0 0x561078a5f808
0x7f552c07cdd8: 0x7f552c07cda0 0x7f552e602c90 <_ZNSt14_Function_base13_Base_managerINSt13__future_base13_State_baseV27_SetterIPN7trantor9EventLoopEOS6_EEE10_M_managerERSt9_Any_dataRKSA_St18_Manager_operation>
0x7f552c07cde8: 0x7f552e602d20 <_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_13_State_baseV27_SetterIPN7trantor9EventLoopEOSA_EEE9_M_invokeERKSt9_Any_data> 0x0
0x7f552c07cdf8: 0x7f552c07cda8 0x7f552c07cdb0
0x7f552c07ce08: 0x7f552c07cdb8 0x1
0x7f552c07ce18: 0x7f552c07d700 0x0
下面是寄存器信息和栈帧信息
(gdb) i r
rax 0x56107971fed0 94628756979408
rbx 0x56107afbe990 94628782795152
rcx 0x56107a93d9b0 94628775975344
rdx 0x561078c5c900 94628745693440
rsi 0x56107a93d9b0 94628775975344
rdi 0x56107bcf3750 94628796643152
rbp 0x56107bcf3740 0x56107bcf3740
rsp 0x7f552c07cd18 0x7f552c07cd18
r8 0x56107ad76a40 94628780403264
r9 0x56107ad76a20 94628780403232
r10 0x7 7
r11 0x246 582
r12 0x7f552c07ce20 140003787656736
r13 0x7f552c07ce30 140003787656752
r14 0x1 1
r15 0x561078a5f8b0 94628743608496
rip 0x56107971fed0 0x56107971fed0
eflags 0x10246 [ PF ZF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
k0 0x0 0
k1 0x0 0
k2 0x0 0
k3 0x0 0
k4 0x0 0
k5 0x0 0
k6 0x0 0
k7 0x0 0
(gdb) i f
Stack level 0, frame at 0x7f552c07cd20:
rip = 0x56107971fed0; saved rip = 0x7f552e60bbcc
called by frame at 0x7f552c07cd30
Arglist at 0x7f552c07cd10, args:
Locals at 0x7f552c07cd10, Previous frame's sp is 0x7f552c07cd20
Saved registers:
rip at 0x7f552c07cd18
版本信息: drogon 1.7.5 trantor 1.5.5
先升级到最新版本试试,这个版本太老了
因为我们在很多生产环境部署的drogon都是这个版本的,要升级的话需要一些流程。
目前我想问的是之前有没有在高负载环境下出现这种回调函数地址为非法的情况,这个问题在一个现场出现3次了
没被报过这个问题,可能有竞态条件,高负载触发的几率增大。 你们环境是使用drogon做client还是server?
是server端
这问题还能分析吗,回调函数这块我看的有点头大...
可以,你要debug编译,然后看coredump的调用堆栈看看崩在哪里了,再考虑修复,但是这个版本是两年前的,估计你修复了也没法在新的版本上打补丁了。只能报告一下错误原因,我再走查一下新版本是不是有这个问题。。。
本地压力测试可以复现吗
本地没复现出来,只在现场出现过
通过修改源码,我已经复现出问题了,主要修改的地方是两个,一个是socket的析构函数中屏蔽释放socket,另一个是epoll_ctl屏蔽对tcp的channel的取消注册,通过这种模拟可以复现出问题。
另外,现场环境替换了添加日志的drogon库trantor,通过日志也可以发现channel指针理论上应该被释放了,但是仍在调用read回调函数,所以应该是epoll删除这个指针失败,同时socket应该也没释放成功且还在接受消息
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] connectDestroyed
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] Channel: remove, chn ptr=0x5573D118F890 owner:0x5573D14C93D0 TcpConnectionImpl
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] EventLoop: removeChannel
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] EpollPoller::removeChannel
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] ~TcpConnectionImpl: free ptr: 0x5573D14C93D0
2024-08-28 18:03:03.672 - INFO - 139939166525184 - [drogon Info] handleEventSafely: handle read:0x5573D118F8B0 chnP=0x5573D118F890 owner:0x5573D14C93D0
@an-tao @fantasy-peak
最新trantor也有这个问题吗
我刚看了一下, 你的意思是 EpollPoller::update 函数中 ::epoll_ctl(epollfd_, operation, fd, &event) 执行失败了对吗? https://github.com/an-tao/trantor/blob/65f245539215a8c25e04cd475c13d16044209a66/trantor/net/inner/poller/EpollPoller.cc#L183 https://github.com/an-tao/trantor/blob/65f245539215a8c25e04cd475c13d16044209a66/trantor/net/inner/poller/EpollPoller.cc#L203
是的,目前猜测是这样的
1.5.5
void TcpServer::handleCloseInLoop(const TcpConnectionPtr &connectionPtr)
{
size_t n = connSet_.erase(connectionPtr);
(void)n;
assert(n == 1);
auto connLoop = connectionPtr->getLoop();
if (connLoop == loop_)
{
static_cast<TcpConnectionImpl *>(connectionPtr.get())
->connectDestroyed();
}
else
{
connLoop->queueInLoop([connectionPtr]() {
static_cast<TcpConnectionImpl *>(connectionPtr.get())
->connectDestroyed();
});
}
}
最新的
void TcpServer::handleCloseInLoop(const TcpConnectionPtr &connectionPtr)
{
size_t n = connSet_.erase(connectionPtr);
(void)n;
assert(n == 1);
auto connLoop = connectionPtr->getLoop();
// NOTE: always queue this operation in connLoop, because this connection
// may be in loop_'s current active channels, waiting to be processed.
// If `connectDestroyed()` is called here, we will be using an wild pointer
// later.
connLoop->queueInLoop(
[connectionPtr]() { connectionPtr->connectDestroyed(); });
}
https://github.com/an-tao/trantor/pull/206 或许已经被修复了, 在最新版中
@shong99 升级最新版本了吗