tiflow DM: Mariadb master-slave switch twice, sync failed

What did you do?

"Mariadb binlog gtid format: {domain_id}{}{server_id}{}{seq_no}

gtid_strict_mode = off (gtid_strict_mode behavior validation: http://youdidwhatwithtsql.com/behavior-gtidstrictmode-mariadb/2089/)

Assumed Example: Data Sync Chain: A -> B -> DM -> TiDB

At time t0:

A's current latest binlog gtid: 0_101_100 B syncs A's latest binlog gtid: 0_101_100 At time t1:

B's slave writes binlog with gtid changes: 0_102_101, 0_102_102 At time t2:

A writes with gtid change: 0_101_101 At time t2:

B backs up gtid sequence: 0_101_100 0_102_101 0_102_102 0_101_101 DM synchronization reports an error: 'less than global checkpoint position.'

At time t3 (Mariadb behavior):

Master-slave switch B's gtid (A consumption scenario):

0_101_100 0_102_101 (A consumes B) 0_102_102 (A consumes B) 0_101_101 (A Skip) From verification tests, in mariadb gtid_strict_mode = off mode, when there is a master-slave switch between A and B, B promotes to master, and Mariadb pulls binlog from the master based on its own gtid seq, ignoring seq_no comparisons. In the above example, it consumes data from t1, skipping data already present in 0_101_101."

What did you expect to see?

sync normally

What did you see instead?

'less than global checkpoint position.'

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

6.5.3

Upstream MySQL/MariaDB server version:

10.1.9

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

No response

Mar 08 '24 04:03 okJiang

Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.

https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups

However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a gtid_strict_mode = off behaviour :thinking: . I don't know if that behaviour will cause more trouble like data loss.

Mar 09 '24 15:03 lance6716

Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.

https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups

However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a gtid_strict_mode = off behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.

Yes, the best way is setting separate domain ID.

But this is also a headache. I learned today that users who encountered this problem were using TDSQL. In the internal implementation of TDSQL, they relied on mariadb's gtid_strict_mode=off and same domain ID to implement master-slave switching. It is difficult for us to push users to modify the implementation of TDSQL

Now I added server-id mapping in MariadbGTIDSet in go-mysql. https://github.com/go-mysql-org/go-mysql/pull/852 Then I successfully executed my local test case(https://github.com/pingcap/tiflow/pull/10753).

A question: Since it can be executed successfully at the source, should we also be able to synchronize it for the same domain-id and server-id 🤔? @GMHDBJD @lance6716

Mar 11 '24 11:03 okJiang

DM may need to configurate a gtid_strict_mode = off behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.

I was worried about this too, so I tried to find other ways

Mar 12 '24 07:03 okJiang

tiflow tiflow copied to clipboard

DM: Mariadb master-slave switch twice, sync failed

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

tiflow
tiflow copied to clipboard

current status of DM cluster (execute `query-status <task-name>` in dmctl)