replication-manager
replication-manager copied to clipboard
switchover with children cluster will set the readonly flag on children cluster primary server
We have 2 different clusters. First cluster 'French' (FR1 & FR2) is a primary/replica cluster. Second cluster 'Europe' (EU1 & EU2) is also a primary/replica cluster; but the 'Europe' primary is also a replica from 'French' only for 'French' subset of data.
When we want to switchover French cluster from primary to secondary; (From FR1 to FR2 for exemple) - in order to perform a work into FR1 - the switch over will put the flag read_only
into EU1.
replication manager logs
time="2024-04-24 07:34:46" level=info msg="Starting master switchover" cluster=cluster_FR_masterslave
time="2024-04-24 07:34:46" level=info msg="Freezing writes set read only on FR-01" cluster=cluster_FR_masterslave
time="2024-04-24 07:34:48" level=info msg="Switching other slaves to the new master" cluster=cluster_FR_masterslave
time="2024-04-24 07:34:48" level=info msg="Waiting for slave EU-01 to sync" cluster=cluster_FR_masterslave
time="2024-04-24 07:34:48" level=info msg="Change master on slave EU-01" cluster=cluster_FR_masterslave
time="2024-04-24 07:34:50" level=info msg="Server EU-01 disable read only as last non slave" cluster=cluster_EU_masterslave
I can't see any "putting read_only flag"
proxysql logs
2024-04-24 07:34:48 [INFO] Server 'EU-01' found with 'read_only=1', but not found as reader
2024-04-24 07:34:50 [INFO] Server 'EU-01' found with 'read_only=0', but not found as writer
The flag will be removed few seconds later; but during this time slot; ProxySQL
see this read_only
flag; and move the EU1 Server as non-writer
generating issues on many application who try to write on it.
This issue (and putting RO/removing it quickly) will generate a race condition on old proxysql (2.0) - breaking the state machine - and forcing to restart the proxysql engine.
Perhaps on :
utils/dbhelper/dbhelper.go :
if cluster.Conf.ReadOnly && cluster.Conf.MxsBinlogOn == false && !cluster.IsInIgnoredReadonly(sl) {
logs, err = sl.SetReadOnly()
cluster.LogSQL(logs, err, sl.URL, "MasterFailover", LvlErr, "Could not set slave %s as read-only, %s", sl.URL, err)
}
Can you explain your topology?
time="2024-04-24 07:34:50" level=info msg="Server EU-01 disable read only as last non slave" cluster=cluster_EU_masterslave
If you can, please explain your topology and your expected behavior. Please send us your configurations and logs for further examination. It will help us in assessing the situation. Thank you
If you can, please explain your topology and your expected behavior. Please send us your configurations and logs for further examination. It will help us in assessing the situation. Thank you
FR Cluster is the "First cluster" in the picture; EU cluster is the second on on the picture.
During a switch over on FR cluster (which is one of the replication source for EU) Replication manager will set the read_only flag on EU Primary cluster. But EU Primary NEED to be read-write all the time.
After the switchover; Replication manager (for EU cluster) see that EU-01 is not RW; and put it again as RW. But we have EU-01 forced (by mistake) as read_only for 2 seconds.
can you show the GUI of the topology? what topology it's detected as? if it was detected as master-slave, then it will try to force EU to read only.
EU1 is detected as secondary, but ignored one.
It is not into the configuration of FR1; but all the replication flow are named; and EU1 is a replica of FR cluster (flow named FR) so detected by Replication Manager as secondary ignored.
Reconfigure EU1 to be replica of FR2 when switchover is the normal way (child cluster SHOULD be a replica of Primary FR).
FR Cluster is the cluster on the top of this picture; EU Cluster is the cluster on the bottom.
Please show the screenshot of the dashboard of replication manager.
Please show the screenshot of the dashboard of replication manager.
It's not easy, names are rewritten for obfuscation reasons, a screenshot would not be obfuscated :'(
We will check on this.
Thank you for your patience.
Already done some patch in #569 and #570
Please reopen if not fixed