replication-manager icon indicating copy to clipboard operation
replication-manager copied to clipboard

Unable to trigger the "replication-error-script"

Open peterpan104 opened this issue 9 months ago • 6 comments

I have configured the replication-error-script. When I execute 'stop slave' on the slave, repman changes the slave's state to "Slave Error", but it does not trigger the replication-error-script,how I can trigger it?

peterpan104 avatar May 16 '24 05:05 peterpan104

From the code:

conn, err = server.GetNewDBConn()

It was only processed when the replication manager is not able to connect to the slave

caffeinated92 avatar May 16 '24 07:05 caffeinated92

When I execute "systemctl stop mysql" on slave ,it still doesn't work.

peterpan104 avatar May 16 '24 08:05 peterpan104

can you try verbosing?

caffeinated92 avatar May 16 '24 08:05 caffeinated92

hi @caffeinated92 You mentioned above that when repman can't connect to the slave, it would trigger the "replication-error-script".

When I execute "systemctl stop mysql" on the slave, repman changes the slave's state to Failed, at which point repman can't connect to the slave. However, the "replication-error-script" still does not work.

When the slave service is "Failed" or "Slave Error", I need to trigger a script to modify the slave to prevent it from continuing to provide service. Is this possible?

If you need more information, please let me know.

peterpan104 avatar May 16 '24 08:05 peterpan104

I mean we have verbose switch within log settings. Can you turn it on and let us see what happens when it was failing? Thank you.

caffeinated92 avatar May 16 '24 08:05 caffeinated92

@caffeinated92

The following shows the detailed log:

stop slave -> repman status 'Slave Error'

time="2024-05-16 17:06:07" level=info msg="Server 172.24.47.187:3306 state transition from Slave changed to: SlaveErr" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#107\n" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=info msg="Server 172.24.47.187:3306 previous state set to: SlaveErr" cluster=l-ptest-lab
time="2024-05-16 17:06:07" level=warning msg="Skip slave in election 172.24.47.187:3306 SQL Thread is stopped" channel=StdOut cluster=l-ptest-lab code=ERR00042 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=OPENED type=state
time="2024-05-16 17:06:07" level=warning msg="Skip slave in election 172.24.47.187:3306 IO Thread is stopped with valid leader" channel=StdOut cluster=l-ptest-lab code=ERR00087 status=OPENED type=state


start slave -> repman status 'slave'

time="2024-05-16 17:15:21" level=info msg="Server 172.24.47.187:3306 state transition from SlaveErr changed to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#151\n" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=info msg="Server 172.24.47.187:3306 previous state set to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:15:21" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="Skip slave in election 172.24.47.187:3306 IO Thread is stopped with valid leader" channel=StdOut cluster=l-ptest-lab code=ERR00087 status=RESOLV type=state
time="2024-05-16 17:15:21" level=warning msg="Skip slave in election 172.24.47.187:3306 SQL Thread is stopped" channel=StdOut cluster=l-ptest-lab code=ERR00042 status=RESOLV type=state

'systemctl stop mariadb' on slave -> repman status 'Failed'

time="2024-05-16 17:17:24" level=info msg="Server 172.24.47.187:3306 state transition from Slave changed to: Suspect" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv.go#425\n" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=info msg="Server 172.24.47.187:3306 previous state set to: Suspect" cluster=l-ptest-lab
time="2024-05-16 17:17:24" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=OPENED type=state
time="2024-05-16 17:17:24" level=warning msg="Skip slave in election 172.24.47.187:3306 does not ping or has no binlogs" channel=StdOut cluster=l-ptest-lab code=ERR00040 status=OPENED type=state
time="2024-05-16 17:17:32" level=info msg="Declaring replica 172.24.47.187:3306 as failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=info msg="Server 172.24.47.187:3306 state transition from Suspect changed to: Failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv.go#412\n" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=error msg="Server 172.24.47.187:3306 state changed from Suspect to Failed" channel=StdOut cluster=l-ptest-lab type=alert
time="2024-05-16 17:17:32" level=info msg="Server 172.24.47.187:3306 previous state set to: Failed" cluster=l-ptest-lab
time="2024-05-16 17:17:32" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=OPENED type=state
time="2024-05-16 17:17:32" level=warning msg="Could not find a slave in topology" channel=StdOut cluster=l-ptest-lab code=ERR00010 status=OPENED type=state
time="2024-05-16 17:17:34" level=warning msg="No compression of binlog on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0056 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="No semisync settings on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0048 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="Skip slave in election 172.24.47.187:3306 does not ping or has no binlogs" channel=StdOut cluster=l-ptest-lab code=ERR00040 status=RESOLV type=state
time="2024-05-16 17:17:34" level=warning msg="No GTID strict mode on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0058 status=RESOLV type=state

'systemctl start mariadb' on slave -> repman status 'slave'

time="2024-05-16 17:18:03" level=warning msg="Variable diff:\n [\n \t{\n \t\t\"variableName\": \"EVENT_SCHEDULER\",\n \t\t\"diffValues\": [\n \t\t\t{\n \t\t\t\t\"serverName\": \"172.24.47.186:3306\",\n \t\t\t\t\"variableValue\": \"ON\"\n \t\t\t},\n \t\t\t{\n \t\t\t\t\"serverName\": \"172.24.47.187:3306\",\n \t\t\t\t\"variableValue\": \"OFF\"\n \t\t\t}\n \t\t]\n \t}\n ]" channel=StdOut cluster=l-ptest-lab code=WARN0084 status=RESOLV type=state
time="2024-05-16 17:18:29" level=error msg="Failed to parse binary log datetime string. . Str: . Host: 172.24.47.187:3306 - mysql-bin.000001" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Server 172.24.47.187:3306 state transition from Failed changed to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Set state called from /var/jenkins_home/workspace/go/src/github.com/signal18/replication-manager/cluster/srv_chk.go#151\n" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Cancel rejoin, found same leader already from replication 172.24.47.186:3306\t" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Server 172.24.47.187:3306 previous state set to: Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=info msg="Can DB Connect creating cookie state:Slave" cluster=l-ptest-lab
time="2024-05-16 17:18:29" level=warning msg="No replica found for routing reads" channel=StdOut cluster=l-ptest-lab code=ERR00085 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="No candidates found in slaves list" channel=StdOut cluster=l-ptest-lab code=ERR00032 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="Could not find a slave in topology" channel=StdOut cluster=l-ptest-lab code=ERR00010 status=RESOLV type=state
time="2024-05-16 17:18:29" level=warning msg="No semisync settings on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0048 status=OPENED type=state
time="2024-05-16 17:18:29" level=warning msg="No GTID strict mode on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0058 status=OPENED type=state
time="2024-05-16 17:18:29" level=warning msg="No compression of binlog on slave 172.24.47.187:3306" channel=StdOut cluster=l-ptest-lab code=WARN0056 status=OPENED type=state

peterpan104 avatar May 16 '24 10:05 peterpan104

I think it's not verbose. Can you click the verbose mode to ON?

It should show debug message when verbose is on

caffeinated92 avatar May 16 '24 10:05 caffeinated92

Hi @caffeinated92

There is too much information, so I provide an attachment. 20240517_test_repman_V1.md

If you need more information please let me know. Thanks.

peterpan104 avatar May 17 '24 07:05 peterpan104

Hi @peterpan104 thank you.

What is the value of replication-error-script in toml?

caffeinated92 avatar May 17 '24 07:05 caffeinated92

repman parameter settings are as follows:

replication-error-script = "/data/script/test_rep_err.sh"

peterpan104 avatar May 17 '24 07:05 peterpan104

Hi @peterpan104 Thanks for hinting. I am going to debug this and see if I can fix this.

caffeinated92 avatar May 17 '24 09:05 caffeinated92

Weird. I got nothing about previous state in your full debug log. While in 16th, there are previous state change. Still on checking.

caffeinated92 avatar May 19 '24 06:05 caffeinated92

Hi @caffeinated92 Because I use "grep 'debug'". A log with all the info is provided below,thanks. 20240517_test_repman_V2.md

peterpan104 avatar May 20 '24 02:05 peterpan104

Hi @peterpan104 , we dig into the code and it appears that this script is for specific case that when a slave is restarted and is in replication error it call a user script that may provide reseed backup.

I would suggest to look at alert script that may best feet your need? I'm curious to know what is the use case for you automation in such hook?

svaroqui avatar May 21 '24 08:05 svaroqui

Hi @svaroqui

Thanks for your reply,I have tried the alert script, but it does not work.

I have two machines, and they are each assigned a writer_VIP and reader_VIP to achieve read/write separation.

## status  hostname	     host  		    VIP
   Master  Server1		 172.24.44.10	172.24.46.10(writer_VIP)
   Slave   Server2		 172.24.45.10	172.24.47.10(reader_VIP)

During failover, the failover_script is triggered to add writer_VIP to Server2.

## status  hostname	     host  		    VIP
   Failed  Server1		 172.24.44.10	
   Master  Server2		 172.24.45.10	172.24.46.10(writer_VIP)、172.24.47.10(reader_VIP)

When Server1 recovers, the rejoin_script is triggered to add the reader_VIP to Server1.

## status  hostname	     host  		    VIP
   Slave   Server1		 172.24.44.10	172.24.47.10(reader_VIP)
   Master  Server2		 172.24.45.10	172.24.46.10(writer_VIP)

When the slave replication error or the service goes down, I need a script that can move the reader_VIP to the Master to avoid having no readable machine, and when the slave recovers, the reader_VIP can be moved back. Is this possible?

peterpan104 avatar May 21 '24 09:05 peterpan104

That's what happen when we control haproxy or proxysql . We will work on that like script for transition slave -> (slave_err| salve_late) and (slave_err| later) to slave

svaroqui avatar May 21 '24 12:05 svaroqui

"Thanks for your reply,I have tried the alert script, but it does not work."

cluster.LogModulePrintf(cluster.Conf.Verbose, config.ConstLogModGeneral, "INFO", "Calling alert script")
out, err := exec.Command(cluster.Conf.AlertScript, alert.Cluster, alert.Host, alert.PrevState, alert.State).CombinedOutput()

Can you specify what does not work in your case so far in possible scenario i see a way to move read vip to the new master by testing alert.State and we could pobably extend such alert script for a solution with an open_state_script and close_state_script that would work the same

svaroqui avatar May 21 '24 13:05 svaroqui

Ok i see why it is not trigger

func (cluster *Cluster) CheckAlert(state state.State) {
	if cluster.Conf.MonitoringAlertTrigger == "" {
		return
	}

	// exit even earlier
	if cluster.Conf.MailTo == "" && cluster.Conf.AlertScript == "" {
		return
	}

	if strings.Contains(cluster.Conf.MonitoringAlertTrigger, state.ErrKey) {

svaroqui avatar May 21 '24 13:05 svaroqui

Hello pushed in v2.3.29 ./replication-manager-osc --config=etc/opensvc/cluster-api/cluster-demo/stephane.toml --monitoring-open-state-script="./share/scripts/openstate.sh" --monitoring-close-state-script="./share/scripts/closestate.sh" --log-level=3 monitor

Note: default log level 2 does not log script execution. We need to improve on this as this is pretty important for post mortem analyze of an issue

The error code you should try to catch is ERR00032 this code would let you know that no slave is available for election this also means that no slave is usable like "late io or sql error , failed, miss configured etc ...."

svaroqui avatar May 21 '24 17:05 svaroqui

Hi @svaroqui Thanks for your help,I'll try it. Can you update the Signal18 repositories? Because currently only v2.3.26 can be installed.

peterpan104 avatar May 22 '24 02:05 peterpan104

Hi @svaroqui Thanks for your help,I'll try it. Can you update the Signal18 repositories? Because currently only v2.3.26 can be installed.

Should be good for for deb rpm is building

svaroqui avatar May 22 '24 07:05 svaroqui