replication-manager
replication-manager copied to clipboard
Make Semisync plugins state failabled
MySQL : 8.0.35 Replication-Manager: docker signal18/replication-manager:2.3
May I ask why there is an error transaction when writing data to the main database【166】. I looped through writing data to the main database and found the following alarm: OPENED WARN0091 : Server as errant transaction 192.168.66.167:33001(this is a slave) RESOLV WARN0091 : Server as errant transaction 192.168.66.167:33001
May I ask how to solve this problem,thanks.
cat /app/replication_manager/config/cluster.d/cluster1.toml
[cluster1]
title = "cluster1"
prov-orchestrator = "onpremise"
prov-db-tags = "innodb,noquerycache,slow,pfs,pkg,linux,smallredolog,logtotable"
prov-db-memory = "256"
prov-db-memory-shared-pct = "threads:16,innodb:60,myisam:10,aria:10,rocksdb:1,tokudb:1,s3:1,archive:1,querycache:0"
prov-db-disk-size = "1"
prov-db-cpu-cores = "1"
prov-db-disk-iops = "300"
db-servers-hosts = "192.168.66.166:33001,192.168.66.167:33001"
db-servers-prefered-master = "192.168.66.166:33001"
db-servers-credential = "ha:123456"
db-servers-connect-timeout = 1
replication-credential = "ha:123456"
verbose = false
log-failed-election = true
log-level = 1
log-rotate-max-age = 7
log-rotate-max-backup = 7
log-rotate-max-size = 5
log-sql-in-monitoring = true
log-sst = true
##############
## TOPOLOGY ##
##############
replication-multi-master = false
replication-multi-tier-slave = false
############
# BACKUPS ##
###########
backup-streaming = false
backup-streaming-aws-access-key-id = "admin"
backup-streaming-aws-access-secret = "xxxx"
backup-streaming-endpoint= "https://s3.signal18.io/"
backup-streaming-region= "fr-1"
backup-streaming-bucket= "repman"
backup-restic = false
backup-restic-aws = false
backup-physical-type = "mariabackup"
backup-logical-type = "mysqldump"
backup-restic-aws-access-secret = "xxxx"
backup-restic-password = "xxxx"
backup-restic-binary-path = "/usr/local/bin/restic"
monitoring-scheduler = false
scheduler-db-servers-logical-backup = false
scheduler-db-servers-logical-backup-cron= "0 0 1 * * 6"
scheduler-db-servers-logs = false
scheduler-db-servers-logs-cron = "0 0 * * * *"
scheduler-db-servers-logs-table-keep = 4
scheduler-db-servers-logs-table-rotate = false
scheduler-db-servers-logs-table-rotate-cron = "0 0 0/6 * * *"
scheduler-db-servers-optimize = false
scheduler-db-servers-optimize-cron = "0 0 3 1 * 5"
scheduler-db-servers-physical-backup = false
scheduler-db-servers-physical-backup-cron = "0 0 0 * * *"
##############
## FAILOVER ##
##############
failover-mode = "automatic"
failover-pre-script = "/etc/replication-manager/script/change_old_master"
failover-post-script = "/etc/replication-manager/script/change_new_master"
## Slaves will re enter with read-only
failover-readonly-state = true
failover-event-scheduler = false
failover-event-status = false
## Failover after N failures detection
failover-falsepositive-ping-counter = 5
## Cancel failover if already N failover
## Cancel failover if last failover was N seconds before
## Cancel failover in semi-sync when one slave is not in sync
## Cancel failover if one slave receive master heartbeat
## Cancel failover when replication delay is more than N seconds
failover-limit = 0
failover-time-limit = 300
failover-at-sync = true
failover-max-slave-delay = 30
failover-restart-unsafe = false
# failover-falsepositive-heartbeat = true
# failover-falsepositive-heartbeat-timeout = 3
# failover-falsepositive-maxscale = false
# failover-falsepositive-maxscale-timeout = 14
# failover-falsepositive-external = false
# failover-falsepositive-external-port = 80
################
## SWITCHOVER ##
################
## In switchover Wait N milliseconds before killing long running transactions
## Cancel switchover if transaction running more than N seconds
## Cancel switchover if write query running more than N seconds
## Cancel switchover if one of the slaves is not synced based on GTID equality
switchover-wait-kill = 5000
switchover-wait-trx = 10
switchover-wait-write-query = 10
switchover-at-equal-gtid = false
switchover-at-sync = true
switchover-max-slave-delay = 30
############
## REJOIN ##
############
autorejoin = true
autorejoin-script = ""
autorejoin-semisync = true
autorejoin-backup-binlog = true
autorejoin-flashback = false
autorejoin-mysqldump = false
####################
## CHECKS & FORCE ##
####################
check-replication-filters = true
check-binlog-filters = true
check-replication-state = true
force-slave-heartbeat= false
force-slave-heartbeat-retry = 5
force-slave-heartbeat-time = 3
force-slave-gtid-mode = false
force-slave-semisync = false
force-slave-failover-readonly-state = false
force-binlog-row = false
force-binlog-annotate = false
force-binlog-slowqueries = false
force-binlog-compress = false
force-binlog-checksum = false
force-inmemory-binlog-cache-size = false
force-disk-relaylog-size-limit = false
force-disk-relaylog-size-limit-size = 1000000000
force-sync-binlog = false
force-sync-innodb = false
##############
## MAXSCALE ##
##############
## for 2 nodes cluster maxscale can be driven by replication manager
maxscale = false
maxscale-binlog = false
maxscale-servers = "192.168.0.201"
maxscale-port = 4003
maxscale-user = "admin"
maxscale-pass = "mariadb"
## When true replication manager drive maxscale server state
## Not required unless multiple maxscale or release does not support detect_stale_slave
maxscale-disable-monitor = false
## maxinfo|maxadmin
maxscale-get-info-method = "maxadmin"
maxscale-maxinfo-port = 4002
maxscale-write-port = 4007
maxscale-read-port = 4008
maxscale-read-write-port = 4006
maxscale-binlog-port = 4000
#############
## HAPROXY ##
#############
## Wrapper mode unless maxscale or proxysql required to be located with replication-manager
haproxy = false
haproxy-binary-path = "/usr/sbin/haproxy"
## Read write traffic
## Read only load balance least connection traffic
haproxy-write-port = 3306
haproxy-read-port = 3307
####################
## SHARDING PROXY ##
####################
mdbshardproxy = false
mdbshardproxy-hosts = "127.0.0.1:3306"
mdbshardproxy-user = "root:mariadb"
Hello we do ``` query := "select gtid_subset('" + gtidMaster + "','" + gtidSlave + "') as slave_is_subset"
err := db.QueryRowx(query).Scan(&count)
if err != nil {
return false, query, err
}
if count == 0 {
return true, query, nil
}
With the parmeter cluster.master.Variables["GTID_EXECUTED"], s.Variables["GTID_EXECUTED"]
Thank you for your reply. This will result in when the master performs a large number of writes, When the slave is still in a catching up state, even if it's just one of the transaction is delayed, replocation-manager detects different values,this log will be continuously output. Actually, this is a normal phenomenon,only when there are more slave transactions than the master is there a problem.
mysql> select gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80487');
+------------------------------------------------------------------------------------------------------------+
| gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80487') |
+------------------------------------------------------------------------------------------------------------+
| 0 |
+------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488');
+------------------------------------------------------------------------------------------------------------+
| gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488') |
+------------------------------------------------------------------------------------------------------------+
| 1 |
+------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
I made a tentative fixe in 2.3.16 please let me know if that is fixing the issue and if you insert on a slave that warning is correctly appearing pupose of errant trx, can you alos confirm you are also running repman on docker and you get a way to test this release without repository ?
Also i have added a parameter to disable check errant trx it's enable by default
@svaroqui I will use rpm testing and get back to you the day after tomorrow,thanks
But I discovered a new problem,the semi synchronous parameters are abnormal[2.3.16]. @svaroqui
The correct display should be as shown in the following figure[2.3.14]:
mysql> select @@hostname;
+------------+
| @@hostname |
+------------+
| test167 |
+------------+
1 row in set (0.00 sec)
mysql> show status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name | Value |
+--------------------------------------------+-----------+
| Rpl_semi_sync_replica_status | OFF | # Currently OFF, but displayed as ON
| Rpl_semi_sync_source_clients | 1 |
| Rpl_semi_sync_source_net_avg_wait_time | 0 |
| Rpl_semi_sync_source_net_wait_time | 0 |
| Rpl_semi_sync_source_net_waits | 216415 |
| Rpl_semi_sync_source_no_times | 1 |
| Rpl_semi_sync_source_no_tx | 1 |
| Rpl_semi_sync_source_status | ON |
| Rpl_semi_sync_source_timefunc_failures | 0 |
| Rpl_semi_sync_source_tx_avg_wait_time | 1017 |
| Rpl_semi_sync_source_tx_wait_time | 220268301 |
| Rpl_semi_sync_source_tx_waits | 216414 |
| Rpl_semi_sync_source_wait_pos_backtraverse | 0 |
| Rpl_semi_sync_source_wait_sessions | 0 |
| Rpl_semi_sync_source_yes_tx | 216414 |
+--------------------------------------------+-----------+
15 rows in set (0.00 sec)
mysql> show variables like '%semi%';
+---------------------------------------------+------------+
| Variable_name | Value |
+---------------------------------------------+------------+
| rpl_semi_sync_replica_enabled | OFF |
| rpl_semi_sync_replica_trace_level | 32 |
| rpl_semi_sync_source_enabled | ON |
| rpl_semi_sync_source_timeout | 3000 |
| rpl_semi_sync_source_trace_level | 32 |
| rpl_semi_sync_source_wait_for_replica_count | 1 |
| rpl_semi_sync_source_wait_no_replica | ON |
| rpl_semi_sync_source_wait_point | AFTER_SYNC |
+---------------------------------------------+------------+
8 rows in set (0.02 sec)
Do you think it matter both plugins should be on or we have to manage it during failover is this special case for MySQL semisync ? Or do you think we wrongly report the state ?
I think the semi synchronous status display is abnormal. The master only has the parameter rpl_semi_sync_source_enabled in the on state,and rpl_semi_sync_replica_enabled = OFF. In [2.3.16], the RepSyn status should not be displayed as enabled on the master. Currently, both MstSyn and RepSyn are displayed as enabled on the master The master should display MstSyn enabled an
I agree but why don't you enable both on each node so that all node can be treated equal and become leader without any orchestration to take place we could improved but this would be cosmetic change and make more work in the failover and rejoin . If we force replication-manger to manage the state of plugins , we also need to declare witch nodes have to use semisync as mixed topology with a single candidate leader is very commun
Because some environments require a semi synchronous state and a normal semi synchronous state slave, in order to prevent accidents, rpl_semi_sync_source_timeout may be set to very large. If all are turned on, it may get stuck after switching.
Ok thanks for clarification it will be done so
there is already an existing flag FailoverSemiSyncState failover-semisync-state did you tested it?