phxsql icon indicating copy to clipboard operation
phxsql copied to clipboard

agent_monitor线程死锁问题

Open weipatty opened this issue 7 years ago • 5 comments

大概情况为: 三台机子,一台由于末知原因一直追同步不上。然后剩下两台选不出主(主信息过期,一直选不出)。集群不工作。

调查发现剩下的两台机子的agent_monitor线程都hang住了 一台估计hang在

int Committer :: NewValueGetIDNoRetry(const std::string & sValue, uint64_t & llInstanceID, SMCtx * poSMCtx)
{
    LogStatus();

    int iLockUseTimeMs = 0;
    bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);
    if (!bHasLock)
    {
        if (iLockUseTimeMs > 0)
        {
            BP->GetCommiterBP()->NewValueGetLockTimeout();
            PLGErr("Try get lock, but timeout, lockusetime %dms", iLockUseTimeMs);
            return PaxosTryCommitRet_Timeout; 
        }
        else
        {
            BP->GetCommiterBP()->NewValueGetLockReject();
            PLGErr("Try get lock, but too many thread waiting, reject");
            return PaxosTryCommitRet_TooManyThreadWaiting_Reject;
        }
    }

    int iLeftTimeoutMs = -1;
    if (m_iTimeoutMs > 0)
    {
        iLeftTimeoutMs = m_iTimeoutMs > iLockUseTimeMs ? m_iTimeoutMs - iLockUseTimeMs : 0;
        if (iLeftTimeoutMs < 200)
        {
            PLGErr("Get lock ok, but lockusetime %dms too long, lefttimeout %dms", iLockUseTimeMs, iLeftTimeoutMs);

            BP->GetCommiterBP()->NewValueGetLockTimeout();

            m_oWaitLock.UnLock();
            return PaxosTryCommitRet_Timeout;
        }
    }

    PLGImp("GetLock ok, use time %dms", iLockUseTimeMs);
    
    BP->GetCommiterBP()->NewValueGetLockOK(iLockUseTimeMs);

    //pack smid to value
    int iSMID = poSMCtx != nullptr ? poSMCtx->m_iSMID : 0;
    
    string sPackSMIDValue = sValue;
    m_poSMFac->PackPaxosValue(sPackSMIDValue, iSMID);

    m_poCommitCtx->NewCommit(&sPackSMIDValue, poSMCtx, iLeftTimeoutMs);
    m_poIOLoop->AddNotify();

    int ret = m_poCommitCtx->GetResult(llInstanceID);

    m_oWaitLock.UnLock();
    return ret;
}

估计在bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);这一句 是取不到锁

hang住前日志如下:

I1115 04:23:19.002990 92537 phx_glog.cpp:86]  Process check running ret 0 
I1115 04:23:20.000658 92537 phx_glog.cpp:86]  DoQuery mysql_query show variables like 'super_read_only'; done 0,  
E1115 04:23:22.002826 92537 phx_glog.cpp:78]  DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running'; 
I1115 04:23:22.003125 92537 phx_glog.cpp:86]  DoQuery mysql_query show status like 'Slave_running'; done 5002, Lost connection to MySQL server during query 
I1115 04:23:22.003175 92537 phx_glog.cpp:86]  CheckRunning check super read only OFF is master 1 ret 0 
I1115 04:23:22.003180 92537 phx_glog.cpp:86]  Process check running ret 0 
I1115 04:23:22.003768 92537 phx_glog.cpp:86]  DoQuery mysql_query show global variables like 'gtid_executed'; done 0,  
I1115 04:23:22.003829 92537 phx_glog.cpp:86]  IsGTIDCompleted get gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173213 max gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173210, master start 

I1115 04:23:22.003849 92537 phx_glog.cpp:86]  SendMasterInfo ip xxx.xxx.xxx.xxx svr id -611884052 version 92025 expire time 0 admin username size 2 replica username size 2 
E1115 04:23:22.003861 92537 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0

hang住前有个 DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running';

在15号04:23,然后就没日志了,其中xxx.xxx.xxx.xxx是自己,然后15号白天重启过一次,mysql一直卡在so的2017-11-16 02:35:52 24033 [Note] get last send gtid failed, svrid [61a0f7cd-c47b-11e7-996a-246e960fda64] ret -1启不来, agent_monitor一直连不上mysql报错。然后是在1116 02:35:54左右可以了,选出了主。因为重启了hang后日志就不发了?

另一台hang在 Committer :: NewValueGetIDNoRetry的GetResult中

int CommitCtx :: GetResult(uint64_t & llSuccInstanceID)
{
    while (!m_bIsCommitEnd)
    {
        m_oSerialLock.WaitTime(1000);
    }

    if (m_iCommitRet == 0)
    {
        llSuccInstanceID = m_llInstanceID;
        PLGImp("commit success, instanceid %lu", llSuccInstanceID);
    }
    else
    {
        PLGErr("commit fail, ret %d", m_iCommitRet);
    }
    
    m_oSerialLock.UnLock();

    return m_iCommitRet;
}

是在等提交吧,

hang住前日志如下:

I1115 04:23:48.003327 28923 phx_glog.cpp:86]  SendMasterInfo ip xxx.xxx.xxx.xxx svr id -1222346155 v
```ersion 92025 expire time 0 admin username size 2 replica username size 2 
E1115 04:23:48.003347 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait `ms` 0 reject rate 0
I1115 04:23:48.003535 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1115 04:23:48.003551 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 

然后是过了快一天, hang住后日志如下:

E1116 02:35:52.248410 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
E1116 02:35:52.250871 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0
I1116 02:35:52.256392 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1116 02:35:52.272595 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 
E1116 02:35:52.282235 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
I1116 02:35:52.282516 28923 phx_glog.cpp:86]  Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms 
W1116 02:35:52.282620 28923 phx_glog.cpp:74]  Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135 
E1116 02:35:52.293189 28923 phx_glog.cpp:78]  ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14 
E1116 02:35:52.293473 28923 phx_glog.cpp:78]  proposal fail, ret -2500 
E1116 02:35:52.293752 28923 phx_glog.cpp:78]  SendMasterInfo set master fail, ret -2500 
I1116 02:35:52.294033 28923 phx_glog.cpp:86]  GetCurrentInstanceInterval 0x2c2f4f0 get newest instance id 35503177 oldest instanceid 26270944 

卡了一天最终返回 14 PaxosTryCommitRet_Conflict 是一个冲突了

请问是什么原因会导致这种死锁?

weipatty avatar Nov 16 '17 09:11 weipatty

image

新跟踪发现: 8001的paxos协议端口,应该不可能有这么大的包吧?要分片了,有一台机子没认这个包,上层没收到,导致选主一直选不通过。

weipatty avatar Nov 20 '17 06:11 weipatty

请问这个问题有没有复现的方法呢?

wodesuck avatar Nov 20 '17 12:11 wodesuck

暂时没有重现方法,但我们用的是比较旧的版本,有没有可能是这个bug的原因? https://github.com/Tencent/phxpaxos/releases/tag/v1.1.1

修复了一个重大BUG,某些编译环境下会导致单调时间获取不正确,从而使得设置了CLOCK_MONOTONIC的condition工作不正常,表现为运行卡死,Master无法选举等。如出现类似现象的,建议立即更新此版本。

weipatty avatar Nov 20 '17 12:11 weipatty

我也不确定是不是版本问题,复现不了的话有点难查。。你现在用的版本是多少?

wodesuck avatar Nov 20 '17 12:11 wodesuck

具体哪个版本不清楚,是在phxpaxos更新这个bug之前的一个版本,

weipatty avatar Nov 20 '17 12:11 weipatty