使用h2:grpc当server端服务(grpc)持续不可用时,client端(brpc)陷入死循环
Describe the bug (描述bug) 使用h2:grpc时,server端服务(grpc)持续不可用,client端(brpc)陷入死循环。grpc服务不可用时,client端出错日志如下:
E0108 07:08:32.286566 111758 xxx_client.cc:166] call xxx server failed, Request to x.x.x.x:52618 failed: [E2001][11.18.42.196:52618][E112]xxx_server response :[E112]Not connected to x.x.x.x:8000 yet, server_id=xxxx [R1][E112]Not connected to x.x.x.x:8000 yet, server_id=x.x.x.x [R2][E112]Not connected to x.x.x.x:8000 yet, server_id=x.x.x.x [R3][E112]Not connected to x.x.x.x:8000 yet, server_id=x.x.x.x
当时已经没有流量了,但client端CPU一直在98%左右无法恢复,pstack输出如下:
大量线程都卡住在这个地方,但实际已经没有任何流量了,多台机器都有这个问题。
Thread 494 (Thread 0x7f6fa67c6700 (LWP 491633)): #0 0x0000000009aa5cc0 in load (__m=std::memory_order_acquire, this=0x7f8767b88080) at /opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7/../../../../include/c++/7/bits/atomic_base.h:396 #1 steal (val=0x7f6fa67c2888, this=0x7f8767b88080) at external/brpc/src/bthread/work_stealing_queue.h:116 #2 bthread::TaskControl::steal_task (this=0x7f9eef03f000, tid=tid@entry=0x7f6fa67c2888, seed=seed@entry=0x7f88a4714050, offset=
) at external/brpc/src/bthread/task_control.cpp:347 #3 0x0000000009a9db10 in steal_task (tid=0x7f6fa67c2888, this=0x7f88a4714000) at external/brpc/src/bthread/task_group.h:224 #4 bthread::TaskGroup::wait_task (this=this@entry=0x7f88a4714000, tid=tid@entry=0x7f6fa67c2888) at external/brpc/src/bthread/task_group.cpp:123 #5 0x0000000009aa3a6f in bthread::TaskGroup::run_main_task (this=this@entry=0x7f88a4714000) at external/brpc/src/bthread/task_group.cpp:150 #6 0x0000000009aa702d in bthread::TaskControl::worker_thread (arg=0x7f9eef03f000) at external/brpc/src/bthread/task_control.cpp:73 #7 0x00007f9fed8aadc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007f9febe9aced in clone () from /lib64/libc.so.6
Thread 549 (Thread 0x7f6fc1ffd700 (LWP 491578)): #0 0x0000000009aa5ca8 in bthread::TaskControl::steal_task (this=0x7f9eef03f000, tid=tid@entry=0x7f994d7f7cc8, seed=seed@entry=0x7f9fb3c0d1d0, offset=
) at external/brpc/src/bthread/task_control.cpp:344 #1 0x0000000009aa4266 in steal_task (tid=0x7f994d7f7cc8, this=0x7f9fb3c0d180) at external/brpc/src/bthread/task_group.h:224 #2 bthread::TaskGroup::sched (pg=pg@entry=0x7f994d7f7d48) at external/brpc/src/bthread/task_group.cpp:590 #3 0x0000000009aa43b0 in bthread::TaskGroup::usleep (pg=pg@entry=0x7f994d7f7d48, timeout_us=timeout_us@entry=100000) at external/brpc/src/bthread/task_group.cpp:827 #4 0x0000000009a98b4c in bthread_usleep (microseconds=microseconds@entry=100000) at external/brpc/src/bthread/bthread.cpp:358 #5 0x0000000009839bf0 in brpc::policy::XXXNamingService::RunNamingService (this=0x7f86af3f3f50, service_name=0x7f86aad8af18 "service_xxxx", actions=0x7f86ac6251e0) at external/brpc/src/brpc/policy/xxx_naming_service.cpp:111 #6 0x00000000097cdbca in brpc::NamingServiceThread::Run (this=0x7f86ac625140) at external/brpc/src/brpc/details/naming_service_thread.cpp:365 #7 0x00000000097cdcf9 in brpc::NamingServiceThread::RunThis (arg= ) at external/brpc/src/brpc/details/naming_service_thread.cpp:268 #8 0x0000000009aa3207 in bthread::TaskGroup::task_runner (skip_remained= ) at external/brpc/src/bthread/task_group.cpp:309 #9 0x0000000009aba771 in bthread_make_fcontext () #10 0x0000000000000000 in ?? ()
To Reproduce (复现方法)
Expected behavior (期望行为)
Versions (各种版本)
OS:
Compiler:
brpc:
protobuf:
Additional context/screenshots (更多上下文/截图)
哪位大神帮看下?
@romiguan
请问server恢复之后,client端有恢复吗? CPU确定都是在steal_task吗,是否可以提供一下CPU profile的信息? 机器有几个CPU核心,brpc线程数是多少呢? 客户端的业务逻辑是否有重试,不断的在访问下游server呢