phxsql
phxsql copied to clipboard
coredump in ReplicationTransfer::CountCheckSum()
今天遇到执行ReplicationTransfer::CountCheckSum()的时候binlogsvr产生coredump 查看coredump和代码以及日志,发现是checksum不一致,然后assert()导致coredump
void ReplicationTransfer::CountCheckSum(const vector
请问什么情况下会导致checksum不一致,这种情况下可以通过屏蔽assert来避免coredump吗?如果这样做会有后果?是否会导致数据不一致?
这里通常有几种情况
- MySQL的数据乱了,与phxbinlogsvr的数据不一致
- 在一台机器上多个phxbinlogsvr同时启动
正常来说重启mysql或者检查是否有多个phxbinlogsvr实例同时运行,能解决该问题。
第二点不会发生,第一点mysql的数据乱了,这个在什么情况下会出现;
这是coredump的打印,能帮忙看下么
E1028 09:32:20.007217 12324 phx_glog.cpp:82] IsGTIDCompleted get gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088, master start W1028 09:32:20.007452 12324 phx_glog.cpp:78] SetMaster set master from not a master node E1028 09:32:20.007571 12324 phx_glog.cpp:82] GetCurrentInstanceInterval 0x25aaa90 get newest instance id 97485 oldest instanceid 0 E1028 09:32:20.152326 12322 phx_glog.cpp:82] Showy: PN8phxpaxos12MessageEventE::OnRead read buflen 1456 small than except len 7712 E1028 09:32:20.152690 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8InstanceE::OnReceivePaxosMsg Now.InstanceID 97492 Msg.InstanceID 97492 MsgType 3 Msg.from_nodeid 14370554362999086913 My.nodeid 6460931784547770177 Seen.LatestInstanceID 97492 I1028 09:32:20.152942 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ChecksumLogic my last checksum 1999798557 other last checksum 1999798557 I1028 09:32:20.152981 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8AcceptorE::OnAccept START Msg.InstanceID 97492 Msg.from_nodeid 14370554362999086913 Msg.ProposalID 1 Msg.ValueLen 7657 E1028 09:32:20.153522 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8LogStoreE::Append ok, offset 88980225 fileid 600 checksum 2531157007 instanceid 97492 buffer size 7696 usetime 0ms sync 1 E1028 09:32:20.153861 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos13AcceptorStateE::Persist GroupIdx 0 InstanceID 97492 PromiseID 1 PromiseNodeID 14370554362999086913 AccectpedID 1 AcceptedNodeID 14370554362999086913 ValueLen 7657 Checksum 2036450670 I1028 09:32:20.154088 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8AcceptorE::OnAccept END Now.InstanceID 97492 ReplyNodeID 14370554362999086913 E1028 09:32:20.156757 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8InstanceE::OnReceivePaxosMsg Now.InstanceID 97492 Msg.InstanceID 97492 MsgType 7 Msg.from_nodeid 14370554362999086913 My.nodeid 6460931784547770177 Seen.LatestInstanceID 97492 I1028 09:32:20.156981 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ChecksumLogic my last checksum 1999798557 other last checksum 1999798557 I1028 09:32:20.156997 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos7LearnerE::OnProposerSendSuccess START Msg.InstanceID 97492 Now.InstanceID 97492 Msg.ProposalID 1 State.AcceptedID 1 State.AcceptedNodeID 14370554362999086913, Msg.from_nodeid 14370554362999086913 I1028 09:32:20.157006 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos7LearnerE::OnProposerSendSuccess END Learn value OK, value 7657 E1028 09:32:20.157097 12315 phx_glog.cpp:82] SMExecute gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 local checksum 14369328657726606551 new checksum 14369328657726606551, check done E1028 09:32:20.157438 12315 phx_glog.cpp:82] AddEvent write data gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 instance id 97492 checksum 14369328657726606551 ret 0 file EVENTDATA-85592 run 0 ms E1028 09:32:20.157676 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 value size 7544 check sum 14369328657726606551, new checksum 14369328657726606551 run 0 ms E1028 09:32:20.157697 6710 phx_glog.cpp:82] GetEvents no data E1028 09:32:20.157699 27130 phx_glog.cpp:82] GetEvents no data I1028 09:32:20.157915 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] New paxos starting, Now.Proposer.InstanceID 97492 Now.Acceptor.InstanceID 97492 Now.Learner.InstanceID 97492 E1028 09:32:20.158164 6710 phx_glog.cpp:82] GetEvent get gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 ret 0 I1028 09:32:20.158371 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] Checksum change, last checksum 1999798557 new checksum 2036450670 I1028 09:32:20.158608 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] New paxos instance has started, Now.Proposer.InstanceID 97493 Now.Acceptor.InstanceID 97493 Now.Learner.InstanceID 97493 E1028 09:32:20.158643 27130 phx_glog.cpp:82] GetEvent get gtid ret 1 E1028 09:32:20.158893 27130 phx_glog.cpp:82] no data wait E1028 09:32:20.159162 6710 phx_glog.cpp:82] GetEvent get gtid ret 1 E1028 09:32:20.159590 6710 phx_glog.cpp:82] CountCheckSum get data check sum 11906315221574228622, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089, gtid check sum 14369328657726606551 data size 7544 W1028 09:32:20.159844 6710 phx_glog.cpp:78] CountCheckSum get data check sum 11906315221574228622, check fail, should be 14369328657726606551 W1028 09:32:20.159975 6710 phx_glog.cpp:78] CountCheckSum get check sum 6186361405985353124 W1028 09:32:20.160091 6710 phx_glog.cpp:78] CountCheckSum get check sum 331030192843701277 W1028 09:32:20.160225 6710 phx_glog.cpp:78] CountCheckSum get check sum 14025773150915059130 W1028 09:32:20.160383 6710 phx_glog.cpp:78] CountCheckSum get check sum 14616285687051518798 W1028 09:32:20.160508 6710 phx_glog.cpp:78] CountCheckSum get check sum 11906315221574228622
@weipatty 重启phxbinlogsvr可以恢复正常运行吗?
@cjcchen phxbinlogsvr进程crash之后,手工拉起进程,看上去正常,但是偶尔还会发生crash。次数很少,目前正在尝试重现。还有就是发生crash的这个节点,和其他两个节点的网络延时大概为30ms,另外两个节点之间的网络延时为1ms多,会不会和网络延时以及timeout参数配置有关?
跟踪了下,这次的check_sum计算是和上次的check_sum有关的
uint64_t Utils::GetCheckSum(const uint64_t &old_checksum, const char *value, const size_t &value_size) { uint64_t checksum = old_checksum; for (size_t i = 0; i < value_size; ++i) { checksum = checksum * 257 + value[i]; } return checksum; }
然后ReplicationTransfer这个类的私有变量 checksum_不知什么原因和上次的不一样了,
E1028 09:31:39.145347 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769087 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 value size 346 check sum 1476019619464914925, new checksum 1476019619464914925 run 0 ms E1028 09:31:39.147161 27130 phx_glog.cpp:82] CountCheckSum get data check sum 1476019619464914925, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088, gtid check sum 1476019619464914925 data size 346 E1028 09:32:20.157676 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 value size 7544 check sum 14369328657726606551, new checksum 14369328657726606551 run 0 ms E1028 09:32:20.159590 6710 phx_glog.cpp:82] CountCheckSum get data check sum 11906315221574228622, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089, gtid check sum 14369328657726606551 data size 7544 W1028 09:32:20.159844 6710 phx_glog.cpp:78] CountCheckSum get data check sum 11906315221574228622, check fail, should be 14369328657726606551 W1028 09:32:20.159975 6710 phx_glog.cpp:78] CountCheckSum get check sum 6186361405985353124 W1028 09:32:20.160091 6710 phx_glog.cpp:78] CountCheckSum get check sum 331030192843701277 W1028 09:32:20.160225 6710 phx_glog.cpp:78] CountCheckSum get check sum 14025773150915059130 W1028 09:32:20.160383 6710 phx_glog.cpp:78] CountCheckSum get check sum 14616285687051518798 W1028 09:32:20.160508 6710 phx_glog.cpp:78] CountCheckSum get check sum 11906315221574228622
如上
1086b42c-9b86-11e6-8901-38eaa715405c:769088, 是check sum 1476019619464914925
1086b42c-9b86-11e6-8901-38eaa715405c:769089, 是check sum 14369328657726606551
上一次的checksum放在checksum_for_log[0] 根据最后的日志打印是6186361405985353124, 按理应该是1476019619464914925吧?
不知什么原因 这个值被改了 @cjcchen