orchestrator
orchestrator copied to clipboard
How does The DelayMasterPromotionIfSQLThreadNotUpToDate parameter work ?
Hi, I want to test how orchestrator works when the master is down and all the slaves lagged behand master;
I found that the orchestrator will stop io_thread and sql_thread on candidate replica,so it will never catch up. Could you tell me why, I would be very grateful!
here is my env:
version
orchestrator :3.2.6 MySQL:MySQL Community Server (GPL) 5.6.36-log(one master with two slaves)
orchestrator main config parameters
{
"ReplicationLagQuery": "",
"FailMasterPromotionOnLagMinutes": 0,
"FailMasterPromotionIfSQLThreadNotUpToDate": false,
"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"ReasonableReplicationLagSeconds": 300,
"RecoveryPeriodBlockSeconds": 3600,
"RecoveryIgnoreHostnameFilters": [],
"RecoverMasterClusterFilters": [
"*"
],
"RecoverIntermediateMasterClusterFilters": [
"*"
],
}
test steps
- use sysbench for stress testing;
- stop all slaves ( stop slave sql_thread);
- wait 3 minutes (within ReasonableReplicationLagSeconds)
- start all slaves(start slave sql_thread ,then all slaves was lagged behand master);
- shutdown master;
- orchestrator promotion one slave to be ideal candidate;
orchestrator recover log
2022-03-01 10:26:04 will handle DeadMaster event on dbpnew84v:3100
2022-03-01 10:26:04 Running 1 PreFailoverProcesses hooks
2022-03-01 10:26:04 Running PreFailoverProcesses hook 1 of 1:
2022-03-01 10:26:04 Completed PreFailoverProcesses hook 1 of 1 in 10.76651ms
2022-03-01 10:26:04 done running PreFailoverProcesses hooks
2022-03-01 10:26:04 RecoverDeadMaster: will recover dbpnew84v:3100
2022-03-01 10:26:04 RecoverDeadMaster: masterRecoveryType=MasterRecoveryGTID
2022-03-01 10:26:04 RecoverDeadMaster: regrouping replicas via GTID
2022-03-01 10:26:15 RecoverDeadMaster: promotedReplicaIsIdeal(dbpnew85v:3100)
2022-03-01 10:26:16 RecoverDeadMaster: found dbpnew85v:3100 to be ideal candidate; will optimize recovery
2022-03-01 10:26:16 RecoverDeadMaster: 1 postponed functions
2022-03-01 10:26:16 promoted replica: dbpnew85v:3100
2022-03-01 10:26:16 RecoverDeadMaster: promoted replica lag seconds: 0
2022-03-01 10:26:16 RecoverDeadMaster: promoted replica sql thread up-to-date: false
2022-03-01 10:26:16 DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on dbpnew85v:3100
2022-03-01 10:31:16 DelayMasterPromotionIfSQLThreadNotUpToDate error: 2022-03-01 10:31:16 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on dbpnew85v:3100 after duration 5m0s
2022-03-01 10:31:16 Running 1 PostUnsuccessfulFailoverProcesses hooks
2022-03-01 10:31:16 Running PostUnsuccessfulFailoverProcesses hook 1 of 1:
2022-03-01 10:31:16 Completed PostUnsuccessfulFailoverProcesses hook 1 of 1 in 14.579948ms
2022-03-01 10:31:16 done running PostUnsuccessfulFailoverProcesses hooks
2022-03-01 10:31:17 Waiting for 1 postponed functions
2022-03-01 10:31:17 Executed 1 postponed functions
2022-03-01 10:31:17 Executed postponed functions: regroup-replicas-gtid dbpnew85v:3100
dbpnew85v general log
866 Connect [email protected] on
866 Query stop slave io_thread
866 Query start slave sql_thread
866 Query stop slave
好像确实有bug
recoverDeadMaster 会调用 RegroupReplicasGTID, RegroupReplicasGTID-> GetCandidateReplica-> sortedReplicasDataCenterHint-> StopReplicas->StopReplicationNicely
StopReplicationNicely会先WaitForSQLThreadUpToDate. 这里WaitForSQLThreadUpToDate超时是InstanceBulkOperationsWaitTimeoutSeconds默认10s
所以你的情况是InstanceBulkOperationsWaitTimeoutSeconds秒超时了. 然后由于超时, err!=nil, 所以return
但是StopReplicas没做任何判断, 还是会执行StopReplication stop slave
到最后checkAndRecoverDeadMaster会执行下面的代码
if config.Config.DelayMasterPromotionIfSQLThreadNotUpToDate && !promotedReplica.SQLThreadUpToDate() {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on %+v", promotedReplica.Key))
if _, err := inst.WaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0); err != nil {
return nil, fmt.Errorf("DelayMasterPromotionIfSQLThreadNotUpToDate error: %+v", err)
}
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on %+v", promotedReplica.Key))
}
这就蛋疼了, sql_thread已经停了. 所以不管ReasonableReplicationLagSeconds设置多大这个 WaitForSQLThreadUpToDate 最终还是会超时. 这也是下面这条日志输出的原因
DelayMasterPromotionIfSQLThreadNotUpToDate error: 2022-03-01 10:31:16 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on dbpnew85v:3100 after duration 5m0s
graph TD;
S(recoverDeadMaster) --> A
A(RegroupReplicasGTID) --> A1(GetCandidateReplica)
A1 --> B(sortedReplicasDataCenterHint)
B --> B1(StopReplicas<br><并发执行>)
B1 --> |<1>| C(StopReplicationNicely)
C --> C1(stop slave io_thread)
C --> C2(stop slave sql_thread)
C1 --> D
C2 --> D
D(instance.SQLDelay == 0<br><不是延时从库>) --> |timeout=InstanceBulkOperationsWaitTimeoutSeconds=10s| D1(WaitForSQLThreadUpToDate)
D1 --> |由于延时比较大,执行慢,在10s内无法触达SQLThreadUpToDate| E(case <-generalTimer.C<br>触发了10s超时)
E --> |<2>| B1
B1 --> |<3><br>此时io_thread和sql_thread均为OFF| F(StopReplication<br>当StopReplicationNicely返回error时,此时orc认为没关闭成功)
F --> F1(stop slave<br>对所有副本又执行了一遍)
S --> G(Expr:DelayMasterPromotionIfSQLThreadNotUpToDate&&!promotedReplica.SQLThreadUpToDate)
G --> G1(此时io_thread和sql_thread均为OFF)
G1 --> |原实现| H(WaitForSQLThreadUpToDate)
H --> H1(触发ReasonableReplicationLagSeconds超时,无论该参数设置多大)
G1 --> |新实现| I(StartAndWaitForSQLThreadUpToDate)
I --> |start slave sql_thread| I1(WaitForSQLThreadUpToDate)
我用的orchestrator-3.2.6 版本, 在代码里没有搜到你说的 "新实现 StartAndWaitForSQLThreadUpToDate" 这个函数, 只有 原实现WaitForSQLThreadUpToDate函数, 你是用的什么版本? 怎么解决的啊?
我用的orchestrator-3.2.6 版本, 在代码里没有搜到你说的 "新实现 StartAndWaitForSQLThreadUpToDate" 这个函数, 只有 原实现WaitForSQLThreadUpToDate函数, 你是用的什么版本? 怎么解决的啊?
调整逻辑
在执行WaitForSQLThreadUpToDate时,先执行start slave sql_thread
代码贡献
这是我们内部的实现,代码来自@Fanduzi 贡献
改动代码如下:
新封装一个方法StartAndWaitForSQLThreadUpToDate
// StartAndWaitForSQLThreadUpToDate starts a replica
// It will actually START the sql_thread even if the replica is completely stopped.
// And will wait for sql thread up to date
func StartAndWaitForSQLThreadUpToDate(instanceKey *InstanceKey, overallTimeout time.Duration, staleCoordinatesTimeout time.Duration) (*Instance, error) {
instance, err := ReadTopologyInstance(instanceKey)
if err != nil {
return instance, log.Errore(err)
}
if !instance.ReplicationThreadsExist() {
return instance, fmt.Errorf("instance is not a replica: %+v", instanceKey)
}
// start sql_thread but catch any errors
for _, cmd := range []string{`start slave sql_thread`} {
if _, err := ExecInstance(instanceKey, cmd); err != nil {
return nil, log.Errorf("%+v: StartReplicationAndWaitForSQLThreadUpToDate: '%q' failed: %+v", *instanceKey, cmd, err)
}
}
if instance.SQLDelay == 0 {
// Otherwise we don't bother.
if instance, err = WaitForSQLThreadUpToDate(instanceKey, overallTimeout, staleCoordinatesTimeout); err != nil {
return instance, err
}
}
instance, err = ReadTopologyInstance(instanceKey)
log.Infof("Started replication and wait for sql_thread up to date on %+v, Read:%+v, Exec:%+v", *instanceKey, instance.ReadBinlogCoordinates, instance.ExecBinlogCoordinates)
return instance, err
}
引用替换
if config.Config.DelayMasterPromotionIfSQLThreadNotUpToDate && !promotedReplica.SQLThreadUpToDate() {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on %+v", promotedReplica.Key))
if _, err := inst.StartAndWaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0); err != nil {
return nil, fmt.Errorf("DelayMasterPromotionIfSQLThreadNotUpToDate error: %+v", err)
}
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on %+v", promotedReplica.Key))
}
if config.Config.DelayMasterPromotionIfSQLThreadNotUpToDate {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("Waiting to ensure the SQL thread catches up on %+v", promotedReplica.Key))
if _, err := inst.StartAndWaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0); err != nil {
return promotedReplica, lostReplicas, err
}
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("SQL thread caught up on %+v", promotedReplica.Key))
}
完事后,你自己在重新打包下即可
我用的orchestrator-3.2.6 版本, 在代码里没有搜到你说的 "新实现 StartAndWaitForSQLThreadUpToDate" 这个函数, 只有 原实现WaitForSQLThreadUpToDate函数, 你是用的什么版本? 怎么解决的啊?
调整逻辑
在执行WaitForSQLThreadUpToDate时,先执行start slave sql_thread
代码贡献
这是我们内部的实现,代码来自@Fanduzi 贡献
改动代码如下:
新封装一个方法
StartAndWaitForSQLThreadUpToDate
// StartAndWaitForSQLThreadUpToDate starts a replica // It will actually START the sql_thread even if the replica is completely stopped. // And will wait for sql thread up to date func StartAndWaitForSQLThreadUpToDate(instanceKey *InstanceKey, overallTimeout time.Duration, staleCoordinatesTimeout time.Duration) (*Instance, error) { instance, err := ReadTopologyInstance(instanceKey) if err != nil { return instance, log.Errore(err) } if !instance.ReplicationThreadsExist() { return instance, fmt.Errorf("instance is not a replica: %+v", instanceKey) } // start sql_thread but catch any errors for _, cmd := range []string{`start slave sql_thread`} { if _, err := ExecInstance(instanceKey, cmd); err != nil { return nil, log.Errorf("%+v: StartReplicationAndWaitForSQLThreadUpToDate: '%q' failed: %+v", *instanceKey, cmd, err) } } if instance.SQLDelay == 0 { // Otherwise we don't bother. if instance, err = WaitForSQLThreadUpToDate(instanceKey, overallTimeout, staleCoordinatesTimeout); err != nil { return instance, err } } instance, err = ReadTopologyInstance(instanceKey) log.Infof("Started replication and wait for sql_thread up to date on %+v, Read:%+v, Exec:%+v", *instanceKey, instance.ReadBinlogCoordinates, instance.ExecBinlogCoordinates) return instance, err }
引用替换
if config.Config.DelayMasterPromotionIfSQLThreadNotUpToDate && !promotedReplica.SQLThreadUpToDate() { AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on %+v", promotedReplica.Key)) if _, err := inst.StartAndWaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0); err != nil { return nil, fmt.Errorf("DelayMasterPromotionIfSQLThreadNotUpToDate error: %+v", err) } AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on %+v", promotedReplica.Key)) } if config.Config.DelayMasterPromotionIfSQLThreadNotUpToDate { AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("Waiting to ensure the SQL thread catches up on %+v", promotedReplica.Key)) if _, err := inst.StartAndWaitForSQLThreadUpToDate(&promotedReplica.Key, 0, 0); err != nil { return promotedReplica, lostReplicas, err } AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("SQL thread caught up on %+v", promotedReplica.Key)) }
完事后,你自己在重新打包下即可
测试了,没有问题, 感谢大佬 !
@Fanduzi 你好,请教您关于StartAndWaitForSQLThreadUpToDate 部分, if instance.SQLDelay == 0 { // Otherwise we don't bother. if instance, err = WaitForSQLThreadUpToDate(instanceKey, overallTimeout, staleCoordinatesTimeout); err != nil { return instance, err } }
为什么是instance.SQLDelay == 0 延迟等于0 的时候执行呢?我理解的这样在追延迟的时候说不起作用的?
SQLDelay
这个SQLDelay我没记错的话指的是CHANGE MASTER TO中的MASTER_DELAY
是的,感谢老哥