phxsql
phxsql copied to clipboard
agent_monitor线程死锁问题
大概情况为: 三台机子,一台由于末知原因一直追同步不上。然后剩下两台选不出主(主信息过期,一直选不出)。集群不工作。
调查发现剩下的两台机子的agent_monitor线程都hang住了 一台估计hang在
int Committer :: NewValueGetIDNoRetry(const std::string & sValue, uint64_t & llInstanceID, SMCtx * poSMCtx)
{
LogStatus();
int iLockUseTimeMs = 0;
bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);
if (!bHasLock)
{
if (iLockUseTimeMs > 0)
{
BP->GetCommiterBP()->NewValueGetLockTimeout();
PLGErr("Try get lock, but timeout, lockusetime %dms", iLockUseTimeMs);
return PaxosTryCommitRet_Timeout;
}
else
{
BP->GetCommiterBP()->NewValueGetLockReject();
PLGErr("Try get lock, but too many thread waiting, reject");
return PaxosTryCommitRet_TooManyThreadWaiting_Reject;
}
}
int iLeftTimeoutMs = -1;
if (m_iTimeoutMs > 0)
{
iLeftTimeoutMs = m_iTimeoutMs > iLockUseTimeMs ? m_iTimeoutMs - iLockUseTimeMs : 0;
if (iLeftTimeoutMs < 200)
{
PLGErr("Get lock ok, but lockusetime %dms too long, lefttimeout %dms", iLockUseTimeMs, iLeftTimeoutMs);
BP->GetCommiterBP()->NewValueGetLockTimeout();
m_oWaitLock.UnLock();
return PaxosTryCommitRet_Timeout;
}
}
PLGImp("GetLock ok, use time %dms", iLockUseTimeMs);
BP->GetCommiterBP()->NewValueGetLockOK(iLockUseTimeMs);
//pack smid to value
int iSMID = poSMCtx != nullptr ? poSMCtx->m_iSMID : 0;
string sPackSMIDValue = sValue;
m_poSMFac->PackPaxosValue(sPackSMIDValue, iSMID);
m_poCommitCtx->NewCommit(&sPackSMIDValue, poSMCtx, iLeftTimeoutMs);
m_poIOLoop->AddNotify();
int ret = m_poCommitCtx->GetResult(llInstanceID);
m_oWaitLock.UnLock();
return ret;
}
估计在bool bHasLock = m_oWaitLock.Lock(m_iTimeoutMs, iLockUseTimeMs);这一句 是取不到锁
hang住前日志如下:
I1115 04:23:19.002990 92537 phx_glog.cpp:86] Process check running ret 0
I1115 04:23:20.000658 92537 phx_glog.cpp:86] DoQuery mysql_query show variables like 'super_read_only'; done 0,
E1115 04:23:22.002826 92537 phx_glog.cpp:78] DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running';
I1115 04:23:22.003125 92537 phx_glog.cpp:86] DoQuery mysql_query show status like 'Slave_running'; done 5002, Lost connection to MySQL server during query
I1115 04:23:22.003175 92537 phx_glog.cpp:86] CheckRunning check super read only OFF is master 1 ret 0
I1115 04:23:22.003180 92537 phx_glog.cpp:86] Process check running ret 0
I1115 04:23:22.003768 92537 phx_glog.cpp:86] DoQuery mysql_query show global variables like 'gtid_executed'; done 0,
I1115 04:23:22.003829 92537 phx_glog.cpp:86] IsGTIDCompleted get gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173213 max gtid 61a0f7cd-c47b-11e7-996a-246e960fda64:1173210, master start
I1115 04:23:22.003849 92537 phx_glog.cpp:86] SendMasterInfo ip xxx.xxx.xxx.xxx svr id -611884052 version 92025 expire time 0 admin username size 2 replica username size 2
E1115 04:23:22.003861 92537 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0
hang住前有个 DoQuery[mysql] mysql_query fail 1, Lost connection to MySQL server during query, command show status like 'Slave_running';
在15号04:23,然后就没日志了,其中xxx.xxx.xxx.xxx是自己,然后15号白天重启过一次,mysql一直卡在so的2017-11-16 02:35:52 24033 [Note] get last send gtid failed, svrid [61a0f7cd-c47b-11e7-996a-246e960fda64] ret -1启不来, agent_monitor一直连不上mysql报错。然后是在1116 02:35:54左右可以了,选出了主。因为重启了hang后日志就不发了?
另一台hang在 Committer :: NewValueGetIDNoRetry的GetResult中
int CommitCtx :: GetResult(uint64_t & llSuccInstanceID)
{
while (!m_bIsCommitEnd)
{
m_oSerialLock.WaitTime(1000);
}
if (m_iCommitRet == 0)
{
llSuccInstanceID = m_llInstanceID;
PLGImp("commit success, instanceid %lu", llSuccInstanceID);
}
else
{
PLGErr("commit fail, ret %d", m_iCommitRet);
}
m_oSerialLock.UnLock();
return m_iCommitRet;
}
是在等提交吧,
hang住前日志如下:
I1115 04:23:48.003327 28923 phx_glog.cpp:86] SendMasterInfo ip xxx.xxx.xxx.xxx svr id -1222346155 v
```ersion 92025 expire time 0 admin username size 2 replica username size 2
E1115 04:23:48.003347 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait `ms` 0 reject rate 0
I1115 04:23:48.003535 28923 phx_glog.cpp:86] Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms
W1115 04:23:48.003551 28923 phx_glog.cpp:74] Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135
然后是过了快一天, hang住后日志如下:
E1116 02:35:52.248410 28923 phx_glog.cpp:78] ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14
E1116 02:35:52.250871 28923 phx_glog.cpp:78] STATUS(0): PN8phxpaxos9CommitterE::LogStatus wait threads 0 avg thread wait ms 0 reject rate 0
I1116 02:35:52.256392 28923 phx_glog.cpp:86] Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms
W1116 02:35:52.272595 28923 phx_glog.cpp:74] Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135
E1116 02:35:52.282235 28923 phx_glog.cpp:78] ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14
I1116 02:35:52.282516 28923 phx_glog.cpp:86] Showy(0): PN8phxpaxos9CommitterE::NewValueGetIDNoRetry GetLock ok, use time 0ms
W1116 02:35:52.282620 28923 phx_glog.cpp:74] Imp(0): PN8phxpaxos9CommitCtxE::NewCommit OK, valuesize 135
E1116 02:35:52.293189 28923 phx_glog.cpp:78] ERR(0): PN8phxpaxos9CommitCtxE::GetResult commit fail, ret 14
E1116 02:35:52.293473 28923 phx_glog.cpp:78] proposal fail, ret -2500
E1116 02:35:52.293752 28923 phx_glog.cpp:78] SendMasterInfo set master fail, ret -2500
I1116 02:35:52.294033 28923 phx_glog.cpp:86] GetCurrentInstanceInterval 0x2c2f4f0 get newest instance id 35503177 oldest instanceid 26270944
卡了一天最终返回 14 PaxosTryCommitRet_Conflict 是一个冲突了
请问是什么原因会导致这种死锁?
新跟踪发现: 8001的paxos协议端口,应该不可能有这么大的包吧?要分片了,有一台机子没认这个包,上层没收到,导致选主一直选不通过。
请问这个问题有没有复现的方法呢?
暂时没有重现方法,但我们用的是比较旧的版本,有没有可能是这个bug的原因? https://github.com/Tencent/phxpaxos/releases/tag/v1.1.1
修复了一个重大BUG,某些编译环境下会导致单调时间获取不正确,从而使得设置了CLOCK_MONOTONIC的condition工作不正常,表现为运行卡死,Master无法选举等。如出现类似现象的,建议立即更新此版本。
我也不确定是不是版本问题,复现不了的话有点难查。。你现在用的版本是多少?
具体哪个版本不清楚,是在phxpaxos更新这个bug之前的一个版本,