replication-manager icon indicating copy to clipboard operation
replication-manager copied to clipboard

Make Semisync plugins state failabled

Open dibrother opened this issue 11 months ago • 12 comments

image

MySQL : 8.0.35 Replication-Manager: docker signal18/replication-manager:2.3

May I ask why there is an error transaction when writing data to the main database【166】. I looped through writing data to the main database and found the following alarm: OPENED WARN0091 : Server as errant transaction 192.168.66.167:33001(this is a slave) RESOLV WARN0091 : Server as errant transaction 192.168.66.167:33001

May I ask how to solve this problem,thanks.

cat /app/replication_manager/config/cluster.d/cluster1.toml 
[cluster1]
title = "cluster1"
prov-orchestrator = "onpremise"
prov-db-tags = "innodb,noquerycache,slow,pfs,pkg,linux,smallredolog,logtotable"
prov-db-memory = "256"
prov-db-memory-shared-pct = "threads:16,innodb:60,myisam:10,aria:10,rocksdb:1,tokudb:1,s3:1,archive:1,querycache:0"
prov-db-disk-size = "1"
prov-db-cpu-cores = "1"
prov-db-disk-iops = "300"

db-servers-hosts = "192.168.66.166:33001,192.168.66.167:33001"
db-servers-prefered-master = "192.168.66.166:33001"
db-servers-credential = "ha:123456"
db-servers-connect-timeout = 1
replication-credential = "ha:123456"

verbose = false
log-failed-election  = true
log-level = 1
log-rotate-max-age = 7
log-rotate-max-backup = 7
log-rotate-max-size = 5
log-sql-in-monitoring   = true
log-sst = true

##############
## TOPOLOGY ##
##############

replication-multi-master = false
replication-multi-tier-slave = false

############
# BACKUPS ##
###########


backup-streaming = false
backup-streaming-aws-access-key-id = "admin"
backup-streaming-aws-access-secret = "xxxx"
backup-streaming-endpoint= "https://s3.signal18.io/"
backup-streaming-region= "fr-1"
backup-streaming-bucket= "repman"

backup-restic = false
backup-restic-aws =  false
backup-physical-type = "mariabackup"
backup-logical-type = "mysqldump"
backup-restic-aws-access-secret = "xxxx"
backup-restic-password = "xxxx"
backup-restic-binary-path = "/usr/local/bin/restic"

monitoring-scheduler = false
scheduler-db-servers-logical-backup  = false
scheduler-db-servers-logical-backup-cron= "0 0 1 * * 6"
scheduler-db-servers-logs   =  false
scheduler-db-servers-logs-cron = "0 0 * * * *"
scheduler-db-servers-logs-table-keep = 4
scheduler-db-servers-logs-table-rotate  = false
scheduler-db-servers-logs-table-rotate-cron = "0 0 0/6 * * *"
scheduler-db-servers-optimize  = false
scheduler-db-servers-optimize-cron = "0 0 3 1 * 5"
scheduler-db-servers-physical-backup = false
scheduler-db-servers-physical-backup-cron = "0 0 0 * * *"

##############
## FAILOVER ##
##############

failover-mode = "automatic"
failover-pre-script = "/etc/replication-manager/script/change_old_master"
failover-post-script = "/etc/replication-manager/script/change_new_master"

## Slaves will re enter with read-only

failover-readonly-state = true
failover-event-scheduler = false
failover-event-status = false

## Failover after N failures detection

failover-falsepositive-ping-counter = 5

## Cancel failover if already N failover
## Cancel failover if last failover was N seconds before
## Cancel failover in semi-sync when one slave is not in sync
## Cancel failover if one slave receive master heartbeat
## Cancel failover when replication delay is more than N seconds

failover-limit = 0
failover-time-limit = 300
failover-at-sync = true
failover-max-slave-delay = 30
failover-restart-unsafe = false

# failover-falsepositive-heartbeat = true
# failover-falsepositive-heartbeat-timeout = 3
# failover-falsepositive-maxscale = false
# failover-falsepositive-maxscale-timeout = 14
# failover-falsepositive-external = false
# failover-falsepositive-external-port = 80

################
## SWITCHOVER ##
################

## In switchover Wait N milliseconds before killing long running transactions
## Cancel switchover if transaction running more than N seconds
## Cancel switchover if write query running more than N seconds
## Cancel switchover if one of the slaves is not synced based on GTID equality

switchover-wait-kill = 5000
switchover-wait-trx = 10
switchover-wait-write-query = 10
switchover-at-equal-gtid = false
switchover-at-sync = true
switchover-max-slave-delay = 30

############
## REJOIN ##
############

autorejoin = true
autorejoin-script = ""
autorejoin-semisync = true
autorejoin-backup-binlog = true
autorejoin-flashback = false
autorejoin-mysqldump = false

####################
## CHECKS & FORCE ##
####################

check-replication-filters = true
check-binlog-filters = true
check-replication-state = true

force-slave-heartbeat= false
force-slave-heartbeat-retry = 5
force-slave-heartbeat-time = 3
force-slave-gtid-mode = false
force-slave-semisync = false
force-slave-failover-readonly-state = false
force-binlog-row = false
force-binlog-annotate = false
force-binlog-slowqueries = false
force-binlog-compress = false
force-binlog-checksum = false
force-inmemory-binlog-cache-size = false
force-disk-relaylog-size-limit = false
force-disk-relaylog-size-limit-size = 1000000000
force-sync-binlog = false
force-sync-innodb = false

##############
## MAXSCALE ##
##############

## for 2 nodes cluster maxscale can be driven by replication manager

maxscale = false
maxscale-binlog = false
maxscale-servers = "192.168.0.201"
maxscale-port = 4003
maxscale-user = "admin"
maxscale-pass = "mariadb"

## When true replication manager drive maxscale server state
## Not required unless multiple maxscale or release does not support detect_stale_slave

maxscale-disable-monitor = false

## maxinfo|maxadmin

maxscale-get-info-method = "maxadmin"
maxscale-maxinfo-port = 4002

maxscale-write-port = 4007
maxscale-read-port = 4008
maxscale-read-write-port = 4006
maxscale-binlog-port = 4000

#############
## HAPROXY ##
#############

## Wrapper mode unless maxscale or proxysql required to be located with replication-manager

haproxy = false
haproxy-binary-path = "/usr/sbin/haproxy"

## Read write traffic
## Read only load balance least connection traffic
haproxy-write-port = 3306
haproxy-read-port = 3307

####################
## SHARDING PROXY ##
####################

mdbshardproxy = false
mdbshardproxy-hosts = "127.0.0.1:3306"
mdbshardproxy-user = "root:mariadb"

dibrother avatar Mar 01 '24 08:03 dibrother

Hello we do ``` query := "select gtid_subset('" + gtidMaster + "','" + gtidSlave + "') as slave_is_subset"

err := db.QueryRowx(query).Scan(&count)
if err != nil {
	return false, query, err
}

if count == 0 {
	return true, query, nil
}

With the parmeter  cluster.master.Variables["GTID_EXECUTED"], s.Variables["GTID_EXECUTED"]

svaroqui avatar Mar 01 '24 08:03 svaroqui

Thank you for your reply. This will result in when the master performs a large number of writes, When the slave is still in a catching up state, even if it's just one of the transaction is delayed, replocation-manager detects different values,this log will be continuously output. Actually, this is a normal phenomenon,only when there are more slave transactions than the master is there a problem.

mysql> select gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80487');
+------------------------------------------------------------------------------------------------------------+
| gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80487') |
+------------------------------------------------------------------------------------------------------------+
|                                                                                                          0 |
+------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488');
+------------------------------------------------------------------------------------------------------------+
| gtid_subset('6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488','6ac3854e-d6d5-11ee-8e77-000c294448db:1-80488') |
+------------------------------------------------------------------------------------------------------------+
|                                                                                                          1 |
+------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

dibrother avatar Mar 01 '24 10:03 dibrother

I made a tentative fixe in 2.3.16 please let me know if that is fixing the issue and if you insert on a slave that warning is correctly appearing pupose of errant trx, can you alos confirm you are also running repman on docker and you get a way to test this release without repository ?

svaroqui avatar Mar 01 '24 11:03 svaroqui

Also i have added a parameter to disable check errant trx it's enable by default

svaroqui avatar Mar 01 '24 12:03 svaroqui

@svaroqui I will use rpm testing and get back to you the day after tomorrow,thanks

dibrother avatar Mar 01 '24 13:03 dibrother

image After setting the parameter check replication error trx=false, this alarm did not appear.

But I discovered a new problem,the semi synchronous parameters are abnormal[2.3.16]. @svaroqui

The correct display should be as shown in the following figure[2.3.14]: image

mysql> select @@hostname;
+------------+
| @@hostname |
+------------+
| test167    |
+------------+
1 row in set (0.00 sec)

mysql> show status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name                              | Value     |
+--------------------------------------------+-----------+
| Rpl_semi_sync_replica_status               | OFF       |                                       #  Currently OFF, but displayed as ON
| Rpl_semi_sync_source_clients               | 1         |
| Rpl_semi_sync_source_net_avg_wait_time     | 0         |
| Rpl_semi_sync_source_net_wait_time         | 0         |
| Rpl_semi_sync_source_net_waits             | 216415    |
| Rpl_semi_sync_source_no_times              | 1         |
| Rpl_semi_sync_source_no_tx                 | 1         |
| Rpl_semi_sync_source_status                | ON        |
| Rpl_semi_sync_source_timefunc_failures     | 0         |
| Rpl_semi_sync_source_tx_avg_wait_time      | 1017      |
| Rpl_semi_sync_source_tx_wait_time          | 220268301 |
| Rpl_semi_sync_source_tx_waits              | 216414    |
| Rpl_semi_sync_source_wait_pos_backtraverse | 0         |
| Rpl_semi_sync_source_wait_sessions         | 0         |
| Rpl_semi_sync_source_yes_tx                | 216414    |
+--------------------------------------------+-----------+
15 rows in set (0.00 sec)

mysql> show variables like '%semi%';
+---------------------------------------------+------------+
| Variable_name                               | Value      |
+---------------------------------------------+------------+
| rpl_semi_sync_replica_enabled               | OFF        |
| rpl_semi_sync_replica_trace_level           | 32         |
| rpl_semi_sync_source_enabled                | ON         |
| rpl_semi_sync_source_timeout                | 3000       |
| rpl_semi_sync_source_trace_level            | 32         |
| rpl_semi_sync_source_wait_for_replica_count | 1          |
| rpl_semi_sync_source_wait_no_replica        | ON         |
| rpl_semi_sync_source_wait_point             | AFTER_SYNC |
+---------------------------------------------+------------+
8 rows in set (0.02 sec)

dibrother avatar Mar 03 '24 16:03 dibrother

Do you think it matter both plugins should be on or we have to manage it during failover is this special case for MySQL semisync ? Or do you think we wrongly report the state ?

svaroqui avatar Mar 06 '24 17:03 svaroqui

I think the semi synchronous status display is abnormal. The master only has the parameter rpl_semi_sync_source_enabled in the on state,and rpl_semi_sync_replica_enabled = OFF. In [2.3.16], the RepSyn status should not be displayed as enabled on the master. Currently, both MstSyn and RepSyn are displayed as enabled on the master The master should display MstSyn enabled an

dibrother avatar Mar 07 '24 00:03 dibrother

I agree but why don't you enable both on each node so that all node can be treated equal and become leader without any orchestration to take place we could improved but this would be cosmetic change and make more work in the failover and rejoin . If we force replication-manger to manage the state of plugins , we also need to declare witch nodes have to use semisync as mixed topology with a single candidate leader is very commun

svaroqui avatar Mar 07 '24 07:03 svaroqui

Because some environments require a semi synchronous state and a normal semi synchronous state slave, in order to prevent accidents, rpl_semi_sync_source_timeout may be set to very large. If all are turned on, it may get stuck after switching.

dibrother avatar Mar 11 '24 02:03 dibrother

Ok thanks for clarification it will be done so

svaroqui avatar Mar 11 '24 09:03 svaroqui

there is already an existing flag FailoverSemiSyncState failover-semisync-state did you tested it?

svaroqui avatar Mar 11 '24 09:03 svaroqui