phxsql icon indicating copy to clipboard operation
phxsql copied to clipboard

coredump in ReplicationTransfer::CountCheckSum()

Open yurongliao opened this issue 8 years ago • 6 comments

今天遇到执行ReplicationTransfer::CountCheckSum()的时候binlogsvr产生coredump 查看coredump和代码以及日志,发现是checksum不一致,然后assert()导致coredump

void ReplicationTransfer::CountCheckSum(const vector &event_list, const string &max_gtid) { ..... if (checksum_ != info.checksum()) { ColorLogError("%s get data check sum %llu, check fail, should be %llu", func, checksum_, info.checksum()); for (size_t i = 0; i < checksum_for_log.size(); ++i) ColorLogError("%s get check sum %llu", func, checksum_for_log[i]); checksum_for_log.push_back(checksum_); STATISTICS(ReplCheckSumFail()); assert(checksum_ == info.checksum()); ...... }

请问什么情况下会导致checksum不一致,这种情况下可以通过屏蔽assert来避免coredump吗?如果这样做会有后果?是否会导致数据不一致?

yurongliao avatar Oct 28 '16 02:10 yurongliao

这里通常有几种情况

  1. MySQL的数据乱了,与phxbinlogsvr的数据不一致
  2. 在一台机器上多个phxbinlogsvr同时启动

正常来说重启mysql或者检查是否有多个phxbinlogsvr实例同时运行,能解决该问题。

cjcchen avatar Oct 28 '16 07:10 cjcchen

第二点不会发生,第一点mysql的数据乱了,这个在什么情况下会出现;

stevenwake avatar Oct 28 '16 10:10 stevenwake

这是coredump的打印,能帮忙看下么

E1028 09:32:20.007217 12324 phx_glog.cpp:82] IsGTIDCompleted get gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088, master start W1028 09:32:20.007452 12324 phx_glog.cpp:78] SetMaster set master from not a master node E1028 09:32:20.007571 12324 phx_glog.cpp:82] GetCurrentInstanceInterval 0x25aaa90 get newest instance id 97485 oldest instanceid 0 E1028 09:32:20.152326 12322 phx_glog.cpp:82] Showy: PN8phxpaxos12MessageEventE::OnRead read buflen 1456 small than except len 7712 E1028 09:32:20.152690 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8InstanceE::OnReceivePaxosMsg Now.InstanceID 97492 Msg.InstanceID 97492 MsgType 3 Msg.from_nodeid 14370554362999086913 My.nodeid 6460931784547770177 Seen.LatestInstanceID 97492 I1028 09:32:20.152942 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ChecksumLogic my last checksum 1999798557 other last checksum 1999798557 I1028 09:32:20.152981 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8AcceptorE::OnAccept START Msg.InstanceID 97492 Msg.from_nodeid 14370554362999086913 Msg.ProposalID 1 Msg.ValueLen 7657 E1028 09:32:20.153522 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8LogStoreE::Append ok, offset 88980225 fileid 600 checksum 2531157007 instanceid 97492 buffer size 7696 usetime 0ms sync 1 E1028 09:32:20.153861 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos13AcceptorStateE::Persist GroupIdx 0 InstanceID 97492 PromiseID 1 PromiseNodeID 14370554362999086913 AccectpedID 1 AcceptedNodeID 14370554362999086913 ValueLen 7657 Checksum 2036450670 I1028 09:32:20.154088 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8AcceptorE::OnAccept END Now.InstanceID 97492 ReplyNodeID 14370554362999086913 E1028 09:32:20.156757 12315 phx_glog.cpp:82] Showy(0): PN8phxpaxos8InstanceE::OnReceivePaxosMsg Now.InstanceID 97492 Msg.InstanceID 97492 MsgType 7 Msg.from_nodeid 14370554362999086913 My.nodeid 6460931784547770177 Seen.LatestInstanceID 97492 I1028 09:32:20.156981 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ChecksumLogic my last checksum 1999798557 other last checksum 1999798557 I1028 09:32:20.156997 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos7LearnerE::OnProposerSendSuccess START Msg.InstanceID 97492 Now.InstanceID 97492 Msg.ProposalID 1 State.AcceptedID 1 State.AcceptedNodeID 14370554362999086913, Msg.from_nodeid 14370554362999086913 I1028 09:32:20.157006 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos7LearnerE::OnProposerSendSuccess END Learn value OK, value 7657 E1028 09:32:20.157097 12315 phx_glog.cpp:82] SMExecute gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 local checksum 14369328657726606551 new checksum 14369328657726606551, check done E1028 09:32:20.157438 12315 phx_glog.cpp:82] AddEvent write data gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 instance id 97492 checksum 14369328657726606551 ret 0 file EVENTDATA-85592 run 0 ms E1028 09:32:20.157676 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 value size 7544 check sum 14369328657726606551, new checksum 14369328657726606551 run 0 ms E1028 09:32:20.157697 6710 phx_glog.cpp:82] GetEvents no data E1028 09:32:20.157699 27130 phx_glog.cpp:82] GetEvents no data I1028 09:32:20.157915 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] New paxos starting, Now.Proposer.InstanceID 97492 Now.Acceptor.InstanceID 97492 Now.Learner.InstanceID 97492 E1028 09:32:20.158164 6710 phx_glog.cpp:82] GetEvent get gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 ret 0 I1028 09:32:20.158371 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] Checksum change, last checksum 1999798557 new checksum 2036450670 I1028 09:32:20.158608 12315 phx_glog.cpp:74] Imp(0): PN8phxpaxos8InstanceE::ReceiveMsgForLearner [Learned] New paxos instance has started, Now.Proposer.InstanceID 97493 Now.Acceptor.InstanceID 97493 Now.Learner.InstanceID 97493 E1028 09:32:20.158643 27130 phx_glog.cpp:82] GetEvent get gtid ret 1 E1028 09:32:20.158893 27130 phx_glog.cpp:82] no data wait E1028 09:32:20.159162 6710 phx_glog.cpp:82] GetEvent get gtid ret 1 E1028 09:32:20.159590 6710 phx_glog.cpp:82] CountCheckSum get data check sum 11906315221574228622, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089, gtid check sum 14369328657726606551 data size 7544 W1028 09:32:20.159844 6710 phx_glog.cpp:78] CountCheckSum get data check sum 11906315221574228622, check fail, should be 14369328657726606551 W1028 09:32:20.159975 6710 phx_glog.cpp:78] CountCheckSum get check sum 6186361405985353124 W1028 09:32:20.160091 6710 phx_glog.cpp:78] CountCheckSum get check sum 331030192843701277 W1028 09:32:20.160225 6710 phx_glog.cpp:78] CountCheckSum get check sum 14025773150915059130 W1028 09:32:20.160383 6710 phx_glog.cpp:78] CountCheckSum get check sum 14616285687051518798 W1028 09:32:20.160508 6710 phx_glog.cpp:78] CountCheckSum get check sum 11906315221574228622

weipatty avatar Oct 28 '16 10:10 weipatty

@weipatty 重启phxbinlogsvr可以恢复正常运行吗?

cjcchen avatar Oct 28 '16 22:10 cjcchen

@cjcchen phxbinlogsvr进程crash之后,手工拉起进程,看上去正常,但是偶尔还会发生crash。次数很少,目前正在尝试重现。还有就是发生crash的这个节点,和其他两个节点的网络延时大概为30ms,另外两个节点之间的网络延时为1ms多,会不会和网络延时以及timeout参数配置有关?

yurongliao avatar Oct 29 '16 03:10 yurongliao

跟踪了下,这次的check_sum计算是和上次的check_sum有关的 uint64_t Utils::GetCheckSum(const uint64_t &old_checksum, const char *value, const size_t &value_size) { uint64_t checksum = old_checksum; for (size_t i = 0; i < value_size; ++i) { checksum = checksum * 257 + value[i]; } return checksum; }

然后ReplicationTransfer这个类的私有变量 checksum_不知什么原因和上次的不一样了,

E1028 09:31:39.145347 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769087 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 value size 346 check sum 1476019619464914925, new checksum 1476019619464914925 run 0 ms E1028 09:31:39.147161 27130 phx_glog.cpp:82] CountCheckSum get data check sum 1476019619464914925, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088, gtid check sum 1476019619464914925 data size 346 E1028 09:32:20.157676 12315 phx_glog.cpp:82] save event send gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769088 buffer gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089 value size 7544 check sum 14369328657726606551, new checksum 14369328657726606551 run 0 ms E1028 09:32:20.159590 6710 phx_glog.cpp:82] CountCheckSum get data check sum 11906315221574228622, max gtid 1086b42c-9b86-11e6-8901-38eaa715405c:769089, gtid check sum 14369328657726606551 data size 7544 W1028 09:32:20.159844 6710 phx_glog.cpp:78] CountCheckSum get data check sum 11906315221574228622, check fail, should be 14369328657726606551 W1028 09:32:20.159975 6710 phx_glog.cpp:78] CountCheckSum get check sum 6186361405985353124 W1028 09:32:20.160091 6710 phx_glog.cpp:78] CountCheckSum get check sum 331030192843701277 W1028 09:32:20.160225 6710 phx_glog.cpp:78] CountCheckSum get check sum 14025773150915059130 W1028 09:32:20.160383 6710 phx_glog.cpp:78] CountCheckSum get check sum 14616285687051518798 W1028 09:32:20.160508 6710 phx_glog.cpp:78] CountCheckSum get check sum 11906315221574228622

如上

1086b42c-9b86-11e6-8901-38eaa715405c:769088, 是check sum 1476019619464914925

1086b42c-9b86-11e6-8901-38eaa715405c:769089, 是check sum 14369328657726606551

上一次的checksum放在checksum_for_log[0] 根据最后的日志打印是6186361405985353124, 按理应该是1476019619464914925吧?

不知什么原因 这个值被改了 @cjcchen

weipatty avatar Oct 29 '16 09:10 weipatty