TarsCpp icon indicating copy to clipboard operation
TarsCpp copied to clipboard

Communicator中的StatReport线程可能会在程序重启的时候导致coredump

Open wengang285 opened this issue 3 years ago • 5 comments

现象

image 发布服务时偶尔会发现服务会coredump,发现core在了调用async_reportMicMsg的时候;

代码如下:


int StatReport::reportMicMsg(MapStatMicMsg& msg,bool bFromClient)
{
    if (msg.empty()) return 0;
    try
    {
       int iLen = 0;
       MapStatMicMsg  mTemp;
       MapStatMicMsg  mStatMsg;
       mStatMsg.clear();
       mTemp.clear();
       {
           Lock lock(*this);
           msg.swap(mStatMsg);
       }

       TLOGTARS("[StatReport::reportMicMsg get size:" << mStatMsg.size()<<"]"<< endl);
       for(MapStatMicMsg::iterator it = mStatMsg.begin(); it != mStatMsg.end(); it++)
       {
           const StatMicMsgHead &head = it->first;
           int iTemLen = STAT_PROTOCOL_LEN +head.masterName.length() + head.slaveName.length() + head.interfaceName.length()
               + head.slaveSetName.length() + head.slaveSetArea.length() + head.slaveSetID.length();
           iLen = iLen + iTemLen;
           if(iLen > _maxReportSize) //不能超过udp 1472
           {
               if(_statPrx)
               {
                   TLOGTARS("[StatReport::reportMicMsg send size:" << mTemp.size()<<"]"<< endl);
                   _statPrx->tars_set_timeout(_reportTimeout)->async_reportMicMsg(NULL,mTemp,bFromClient, ServerConfig::Context);
               }
               iLen = iTemLen;
               mTemp.clear();
           }

           mTemp[head] = it->second;
           if(LOG->isNeedLog(LocalRollLogger::INFO_LOG))
           {
               ostringstream os;
               os.str("");
               head.displaySimple(os);
               os << "  ";
               mTemp[head].displaySimple(os);
               TLOGTARS("[StatReport::reportMicMsg display:" << os.str() << "]" << endl);
           }
       }
       if(0 != (int)mTemp.size())
       {
           if(_statPrx)
           {
               TLOGTARS("[StatReport::reportMicMsg send size:" << mTemp.size()<<"]"<< endl);
               _statPrx->tars_set_timeout(_reportTimeout)->async_reportMicMsg(NULL,mTemp,bFromClient, ServerConfig::Context);
           }
       }
       return 0;
    }
    catch ( exception& e )
    {
        TLOGERROR("StatReport::report catch exception:" << e.what() << endl);
    }
    catch ( ... )
    {
        TLOGERROR("StatReport::report catch unkown exception" << endl);
    }
    return -1;
}

问题出在async_reportMicMsg方法中, 因为在调用async_reportMicMsg方法的时候时候,通信器可能已经销毁;

原因:


void StatReport::run()
{
    while(!_terminate)
    {
        {
            Lock lock(*this);

            if (_terminate)
                return;

            timedWait(1000);

        }

        try
        {
            time_t tNow = TNOW;

            if(tNow - _time > _reportInterval/1000)
            {
                reportMicMsg(_statMicMsgClient, true);

                reportMicMsg(_statMicMsgServer, false);

                MapStatMicMsg mStatMsg;

                for(size_t i = 0; i < _epollNum; ++i)
                {
                    MapStatMicMsg * pStatMsg;
                    while(_statMsg[i]->pop_front(pStatMsg))
                    {
                        addMicMsg(mStatMsg,*pStatMsg);
                        delete pStatMsg;
                    }
                }
void StatReport::terminate()
{
    Lock lock(*this);

    _terminate = true;

    notifyAll();
}

上面代码中,StatReport在调用timedWait中等待1s,让出锁,在等待的过程中Communicator::terminate方法会调用 _statReport->terminate();方法,这时候_terminate被设置为true,同时 StatReport线程被唤醒,继续往下走到了reportMicMsg方法中,但是这个时候,通信器可能已经被销毁了,从而导致coredump;

解决方法

timeWait(1000)之后,再判断一下_terminate,如下:

void StatReport::run()
{
    while(!_terminate)
    {
        {
            Lock lock(*this);

            if (_terminate)
                return;

            timedWait(1000);
          
        }
        {
               Lock lock(*this);
               if (_terminate)
                return;
        }

wengang285 avatar Aug 05 '21 11:08 wengang285

奇怪了, Communicator terminate的时候会等到StatReport线程结束, 会join线程的, 因此StatReport::run里面不应该出现这个问题才对?

ruanshudong avatar Aug 05 '21 13:08 ruanshudong

奇怪了, Communicator terminate的时候会等到StatReport线程结束, 会join线程的, 因此StatReport::run里面不应该出现这个问题才对?

StatReport线程醒来之后,继续执行后面的reportMicMsg,调用Invoke,但是_communicatorEpoll线程这个时候已经被delete掉了

wengang285 avatar Aug 17 '21 07:08 wengang285

确实有问题, 我改一下

ruanshudong avatar Aug 19 '21 06:08 ruanshudong

你好,我是金腾科技的后台开发gavin,我们公司一直使用tars框架作为我们的开发框架,在使用过程中,我们也发现了一些tars框架的bug并修复,并添加了一些新的特性,如无损发布,连接回收等等,也想参与tars框架的开源建设,请问需要如何与你们联系呢?

wengang285 avatar Aug 31 '21 03:08 wengang285

你加过我微信么? 可以发邮件给我 [email protected]

ruanshudong avatar Sep 03 '21 02:09 ruanshudong