流量大的时候,引起异常bug,即使扩容也不生效
Describe the bug (描述bug) 最近在生产环境出了一次故障,现象是流量扛不住的时候,自动扩容,但即使这样,扩容的部分CPU也会升至100%,即使流量后面没有变化,容器数量近乎是原来的6倍
在测试环境压测了下,复现了。现象是开启了3个容器,其中一个容器在一开始CPU逐渐上升至100%,而其他容器的cpu仍然是20%,在开始停止的时刻,其他两个容器cpu也迅速升至100%
#0 0x00007fa34edd1b5d in read () from /lib64/libc.so.6 #1 0x00000000007e824e in google::ReadPersistent(int, void*, unsigned long) () #2 0x00000000007e892e in google::FindSymbol(unsigned long, int, char*, int, unsigned long*, unsigned long, Elf64_Shdr const*, Elf64_Shdr const*) () #3 0x00000000007e907f in google::SymbolizeAndDemangle(void*, char*, int, unsigned long*) () #4 0x00000000006e8cd1 in butil::debug::(anonymous namespace)::ProcessBacktrace(void* const*, unsigned long, butil::debug::(anonymous namespace)::BacktraceOutputHandler*) () #5 0x00000000006e8d77 in butil::debug::StackTrace::OutputToStream(std::ostream*) const () #6 0x0000000000706f13 in logging::LogStream::FlushWithoutReset() () #7 0x0000000000707350 in logging::LogMessage::~LogMessage() () #8 0x0000000000610464 in brpc::policy::HttpResponseSender::~HttpResponseSender() () #9 0x0000000000613bb4 in brpc::policy::HttpResponseSenderAsDone::~HttpResponseSenderAsDone() () #10 0x0000000000498c10 in mtad::ssp::AsyncLoadHandler::SendResponse() () #11 0x00000000004e7fae in mtad::ssp::PMPPreLoadSession::OnPreLoadFinish() () #12 0x00000000004e6bc7 in brpc::internal::MethodClosure0<mtad::ssp::PMPPreLoadSession, std::shared_ptrmtad::ssp::PMPPreLoadSession >::Run() () #13 0x00000000005a1b52 in brpc::Controller::EndRPC(brpc::Controller::CompletionInfo const&) () #14 0x00000000005a1eb0 in brpc::Controller::RunEndRPC(void*) () #15 0x000000000058f9e7 in bthread::TaskGroup::task_runner(long) () #16 0x000000000073f5e1 in bthread_make_fcontext () #17 0x0000000000000000 in ?? ()
To Reproduce (复现方法)
Expected behavior (期望行为)
Versions (各种版本) OS: centos 7.6 Compiler: gcc14.2 brpc:1.11.0 protobuf: 27.3
Additional context/screenshots (更多上下文/截图)
#0 0x00007fa34edd1b5d in read () from /lib64/libc.so.6 #1 0x00000000007e824e in google::ReadPersistent(int, void*, unsigned long) () #2 0x00000000007e892e in google::FindSymbol(unsigned long, int, char*, int, unsigned long*, unsigned long, Elf64_Shdr const*, Elf64_Shdr const*) () #3 0x00000000007e907f in google::SymbolizeAndDemangle(void*, char*, int, unsigned long*) () #4 0x00000000006e8cd1 in butil::debug::(anonymous namespace)::ProcessBacktrace(void* const*, unsigned long, butil::debug::(anonymous namespace)::BacktraceOutputHandler*) () #5 0x00000000006e8d77 in butil::debug::StackTrace::OutputToStream(std::ostream*) const () #6 0x0000000000706f13 in logging::LogStream::FlushWithoutReset() () #7 0x0000000000707350 in logging::LogMessage::~LogMessage() () #8 0x0000000000610464 in brpc::policy::HttpResponseSender::~HttpResponseSender() () #9 0x0000000000613bb4 in brpc::policy::HttpResponseSenderAsDone::~HttpResponseSenderAsDone() () #10 0x0000000000498c10 in mtad::ssp::AsyncLoadHandler::SendResponse() () #11 0x00000000004e7fae in mtad::ssp::PMPPreLoadSession::OnPreLoadFinish() () #12 0x00000000004e6bc7 in brpc::internal::MethodClosure0<mtad::ssp::PMPPreLoadSession, std::shared_ptrmtad::ssp::PMPPreLoadSession >::Run() () #13 0x00000000005a1b52 in brpc::Controller::EndRPC(brpc::Controller::CompletionInfo const&) () #14 0x00000000005a1eb0 in brpc::Controller::RunEndRPC(void*) () #15 0x000000000058f9e7 in bthread::TaskGroup::task_runner(long) () #16 0x000000000073f5e1 in bthread_make_fcontext () #17 0x0000000000000000 in ?? ()
这里打了什么日志?是不是日志太多导致的呢?
#0 0x00007fa34edd1b5d in read () from /lib64/libc.so.6 #1 0x00000000007e824e in google::ReadPersistent(int, void*, unsigned long) () #2 0x00000000007e892e in google::FindSymbol(unsigned long, int, char*, int, unsigned long*, unsigned long, Elf64_Shdr const*, Elf64_Shdr const*) () #3 0x00000000007e907f in google::SymbolizeAndDemangle(void*, char*, int, unsigned long*) () #4 0x00000000006e8cd1 in butil::debug::(anonymous namespace)::ProcessBacktrace(void* const*, unsigned long, butil::debug::(anonymous namespace)::BacktraceOutputHandler*) () #5 0x00000000006e8d77 in butil::debug::StackTrace::OutputToStream(std::ostream*) const () #6 0x0000000000706f13 in logging::LogStream::FlushWithoutReset() () #7 0x0000000000707350 in logging::LogMessage::~LogMessage() () #8 0x0000000000610464 in brpc::policy::HttpResponseSender::~HttpResponseSender() () #9 0x0000000000613bb4 in brpc::policy::HttpResponseSenderAsDone::~HttpResponseSenderAsDone() () #10 0x0000000000498c10 in mtad::ssp::AsyncLoadHandler::SendResponse() () #11 0x00000000004e7fae in mtad::ssp::PMPPreLoadSession::OnPreLoadFinish() () #12 0x00000000004e6bc7 in brpc::internal::MethodClosure0<mtad::ssp::PMPPreLoadSession, std::shared_ptrmtad::ssp::PMPPreLoadSession >::Run() () #13 0x00000000005a1b52 in brpc::Controller::EndRPC(brpc::Controller::CompletionInfo const&) () #14 0x00000000005a1eb0 in brpc::Controller::RunEndRPC(void*) () #15 0x000000000058f9e7 in bthread::TaskGroup::task_runner(long) () #16 0x000000000073f5e1 in bthread_make_fcontext () #17 0x0000000000000000 in ?? ()
这里打了什么日志?是不是日志太多导致的呢?
看brpc代码,里面貌似是一些trace信息
通过分析其他更多的服务,在流量变化不大的时候,发现在pod创建后1分钟内CPU就直接被打满
用cpu profiler分析看看
用cpu profiler分析看看
请教个问题,我们这个 是入口服务,即用brpc实现的server,又有 client对外请求
是不是可以 在服务层面增加增加措施,如果服务扛不住了,可以不再处理后续请求,这块有什么好的建议么