replication-manager
replication-manager copied to clipboard
Unable to trigger the "replication-error-script"
I have configured the replication-error-script. When I execute 'stop slave' on the slave, repman changes the slave's state to "Slave Error", but it does not trigger the replication-error-script,how I can trigger it?
From the code:
conn, err = server.GetNewDBConn()
It was only processed when the replication manager is not able to connect to the slave
When I execute "systemctl stop mysql" on slave ,it still doesn't work.
can you try verbosing?
hi @caffeinated92 You mentioned above that when repman can't connect to the slave, it would trigger the "replication-error-script".
When I execute "systemctl stop mysql" on the slave, repman changes the slave's state to Failed, at which point repman can't connect to the slave. However, the "replication-error-script" still does not work.
When the slave service is "Failed" or "Slave Error", I need to trigger a script to modify the slave to prevent it from continuing to provide service. Is this possible?
If you need more information, please let me know.
I mean we have verbose switch within log settings. Can you turn it on and let us see what happens when it was failing? Thank you.
@caffeinated92
The following shows the detailed log:
stop slave -> repman status 'Slave Error'
time="2024-05-16 17:06:07" level=info msg="Server 172.24.47.187:3306 state transition from Slave changed to: SlaveErr" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#107\n" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=info msg="Server 172.24.47.187:3306 previous state set to: SlaveErr" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=warning msg="Skip slave in election 172.24.47.187:3306 SQL Thread is stopped" channel=StdOut cluster=l-ptest-lab code=ERR00042 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="Skip slave in election 172.24.47.187:3306 IO Thread is stopped with valid leader" channel=StdOut cluster=l-ptest-lab code=ERR00087 status=OPENED type=state
start slave -> repman status 'slave'
time="2024-05-16 17:15:21" level=info msg="Server 172.24.47.187:3306 state transition from SlaveErr changed to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#151\n" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=info msg="Server 172.24.47.187:3306 previous state set to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="Skip slave in election 172.24.47.187:3306 IO Thread is stopped with valid leader" channel=StdOut cluster=l-ptest-lab code=ERR00087 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="Skip slave in election 172.24.47.187:3306 SQL Thread is stopped" channel=StdOut cluster=l-ptest-lab code=ERR00042 status=RESOLV type=state
'systemctl stop mariadb' on slave -> repman status 'Failed'
time="2024-05-16 17:17:24" level=info msg="Server 172.24.47.187:3306 state transition from Slave changed to: Suspect" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv.go#425\n" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=info msg="Server 172.24.47.187:3306 previous state set to: Suspect" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=OPENED type=state
time="2024-05-16 17:17:24" level=warning msg="Skip slave in election 172.24.47.187:3306 does not ping or has no binlogs" channel=StdOut cluster=l-ptest-lab code=ERR00040 status=OPENED type=state
time="2024-05-16 17:17:32" level=info msg="Declaring replica 172.24.47.187:3306 as failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=info msg="Server 172.24.47.187:3306 state transition from Suspect changed to: Failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv.go#412\n" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=error msg="Server 172.24.47.187:3306 state changed from Suspect to Failed" channel=StdOut cluster=l-ptest-lab type=alert
time="2024-05-16 17:17:32" level=info msg="Server 172.24.47.187:3306 previous state set to: Failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=OPENED type=state
time="2024-05-16 17:17:32" level=warning msg="Could not find a slave in topology" channel=StdOut cluster=l-ptest-lab code=ERR00010 status=OPENED type=state
time="2024-05-16 17:17:34" level=warning msg="No compression of binlog on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0056 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="No semisync settings on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0048 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="Skip slave in election 172.24.47.187:3306 does not ping or has no binlogs" channel=StdOut cluster=l-ptest-lab code=ERR00040 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="No GTID strict mode on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0058 status=RESOLV type=state
'systemctl start mariadb' on slave -> repman status 'slave'
time="2024-05-16 17:18:03" level=warning msg="Variable diff:\n [\n \t{\n \t\t\"variableName\": \"EVENT_SCHEDULER\",\n \t\t\"diffValues\": [\n \t\t\t{\n \t\t\t\t\"serverName\": \"172.24.47.186:3306\",\n \t\t\t\t\"variableValue\": \"ON\"\n \t\t\t},\n \t\t\t{\n \t\t\t\t\"serverName\": \"172.24.47.187:3306\",\n \t\t\t\t\"variableValue\": \"OFF\"\n \t\t\t}\n \t\t]\n \t}\n ]" channel=StdOut cluster=l-ptest-lab code=WARN0084 status=RESOLV type=state
time="2024-05-16 17:18:29" level=error msg="Failed to parse binary log datetime string. . Str: . Host: 172.24.47.187:3306 - mysql-bin.000001" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Server 172.24.47.187:3306 state transition from Failed changed to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#151\n" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Cancel rejoin, found same leader already from replication 172.24.47.186:3306\t" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Server 172.24.47.187:3306 previous state set to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Can DB Connect creating cookie state:Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="Could not find a slave in topology" channel=StdOut cluster=l-ptest-lab code=ERR00010 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="No semisync settings on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0048 status=OPENED type=state
time="2024-05-16 17:18:29" level=warning msg="No GTID strict mode on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0058 status=OPENED type=state
time="2024-05-16 17:18:29" level=warning msg="No compression of binlog on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0056 status=OPENED type=state
I think it's not verbose. Can you click the verbose mode to ON?
It should show debug message when verbose is on
Hi @caffeinated92
There is too much information, so I provide an attachment. 20240517_test_repman_V1.md
If you need more information please let me know. Thanks.
Hi @peterpan104 thank you.
What is the value of replication-error-script in toml?
repman parameter settings are as follows:
replication-error-script = "/data/script/test_rep_err.sh"
Hi @peterpan104 Thanks for hinting. I am going to debug this and see if I can fix this.
Weird. I got nothing about previous state in your full debug log. While in 16th, there are previous state change. Still on checking.
Hi @caffeinated92 Because I use "grep 'debug'". A log with all the info is provided below,thanks. 20240517_test_repman_V2.md
Hi @peterpan104 , we dig into the code and it appears that this script is for specific case that when a slave is restarted and is in replication error it call a user script that may provide reseed backup.
I would suggest to look at alert script that may best feet your need? I'm curious to know what is the use case for you automation in such hook?
Hi @svaroqui
Thanks for your reply,I have tried the alert script, but it does not work.
I have two machines, and they are each assigned a writer_VIP and reader_VIP to achieve read/write separation.
## status hostname host VIP
Master Server1 172.24.44.10 172.24.46.10(writer_VIP)
Slave Server2 172.24.45.10 172.24.47.10(reader_VIP)
During failover, the failover_script is triggered to add writer_VIP to Server2.
## status hostname host VIP
Failed Server1 172.24.44.10
Master Server2 172.24.45.10 172.24.46.10(writer_VIP)、172.24.47.10(reader_VIP)
When Server1 recovers, the rejoin_script is triggered to add the reader_VIP to Server1.
## status hostname host VIP
Slave Server1 172.24.44.10 172.24.47.10(reader_VIP)
Master Server2 172.24.45.10 172.24.46.10(writer_VIP)
When the slave replication error or the service goes down, I need a script that can move the reader_VIP to the Master to avoid having no readable machine, and when the slave recovers, the reader_VIP can be moved back. Is this possible?
That's what happen when we control haproxy or proxysql . We will work on that like script for transition slave -> (slave_err| salve_late) and (slave_err| later) to slave
"Thanks for your reply,I have tried the alert script, but it does not work."
cluster.LogModulePrintf(cluster.Conf.Verbose, config.ConstLogModGeneral, "INFO", "Calling alert script")
out, err := exec.Command(cluster.Conf.AlertScript, alert.Cluster, alert.Host, alert.PrevState, alert.State).CombinedOutput()
Can you specify what does not work in your case so far in possible scenario i see a way to move read vip to the new master by testing alert.State and we could pobably extend such alert script for a solution with an open_state_script and close_state_script that would work the same
Ok i see why it is not trigger
func (cluster *Cluster) CheckAlert(state state.State) {
if cluster.Conf.MonitoringAlertTrigger == "" {
return
}
// exit even earlier
if cluster.Conf.MailTo == "" && cluster.Conf.AlertScript == "" {
return
}
if strings.Contains(cluster.Conf.MonitoringAlertTrigger, state.ErrKey) {
Hello pushed in v2.3.29
./replication-manager-osc --config=etc/opensvc/cluster-api/cluster-demo/stephane.toml --monitoring-open-state-script="./share/scripts/openstate.sh" --monitoring-close-state-script="./share/scripts/closestate.sh" --log-level=3 monitor
Note: default log level 2 does not log script execution. We need to improve on this as this is pretty important for post mortem analyze of an issue
The error code you should try to catch is ERR00032 this code would let you know that no slave is available for election this also means that no slave is usable like "late io or sql error , failed, miss configured etc ...."
Hi @svaroqui Thanks for your help,I'll try it. Can you update the Signal18 repositories? Because currently only v2.3.26 can be installed.
Hi @svaroqui Thanks for your help,I'll try it. Can you update the Signal18 repositories? Because currently only v2.3.26 can be installed.
Should be good for for deb rpm is building