tiflow
tiflow copied to clipboard
DM: Mariadb master-slave switch twice, sync failed
What did you do?
"Mariadb binlog gtid format: {domain_id}{}{server_id}{}{seq_no}
gtid_strict_mode = off (gtid_strict_mode behavior validation: http://youdidwhatwithtsql.com/behavior-gtidstrictmode-mariadb/2089/)
Assumed Example: Data Sync Chain: A -> B -> DM -> TiDB
At time t0:
A's current latest binlog gtid: 0_101_100 B syncs A's latest binlog gtid: 0_101_100 At time t1:
B's slave writes binlog with gtid changes: 0_102_101, 0_102_102 At time t2:
A writes with gtid change: 0_101_101 At time t2:
B backs up gtid sequence: 0_101_100 0_102_101 0_102_102 0_101_101 DM synchronization reports an error: 'less than global checkpoint position.'
At time t3 (Mariadb behavior):
Master-slave switch B's gtid (A consumption scenario):
0_101_100 0_102_101 (A consumes B) 0_102_102 (A consumes B) 0_101_101 (A Skip) From verification tests, in mariadb gtid_strict_mode = off mode, when there is a master-slave switch between A and B, B promotes to master, and Mariadb pulls binlog from the master based on its own gtid seq, ignoring seq_no comparisons. In the above example, it consumes data from t1, skipping data already present in 0_101_101."
What did you expect to see?
sync normally
What did you see instead?
'less than global checkpoint position.'
Versions of the cluster
DM version (run dmctl -V
or dm-worker -V
or dm-master -V
):
6.5.3
Upstream MySQL/MariaDB server version:
10.1.9
Downstream TiDB cluster version (execute SELECT tidb_version();
in a MySQL client):
(paste TiDB cluster version here)
How did you deploy DM: tiup or manually?
(leave TiUP or manually here)
Other interesting information (system version, hardware config, etc):
>
>
current status of DM cluster (execute query-status <task-name>
in dmctl)
No response
Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.
https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups
However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a gtid_strict_mode = off
behaviour :thinking: . I don't know if that behaviour will cause more trouble like data loss.
Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.
https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups
However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a
gtid_strict_mode = off
behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.
Yes, the best way is setting separate domain ID.
But this is also a headache. I learned today that users who encountered this problem were using TDSQL. In the internal implementation of TDSQL, they relied on mariadb's gtid_strict_mode=off
and same domain ID to implement master-slave switching. It is difficult for us to push users to modify the implementation of TDSQL
Now I added server-id mapping in MariadbGTIDSet
in go-mysql. https://github.com/go-mysql-org/go-mysql/pull/852 Then I successfully executed my local test case(https://github.com/pingcap/tiflow/pull/10753).
A question: Since it can be executed successfully at the source, should we also be able to synchronize it for the same domain-id and server-id 🤔? @GMHDBJD @lance6716
DM may need to configurate a gtid_strict_mode = off behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.
I was worried about this too, so I tried to find other ways