tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

DM: Mariadb master-slave switch twice, sync failed

Open okJiang opened this issue 11 months ago • 1 comments

What did you do?

"Mariadb binlog gtid format: {domain_id}{}{server_id}{}{seq_no}

gtid_strict_mode = off (gtid_strict_mode behavior validation: http://youdidwhatwithtsql.com/behavior-gtidstrictmode-mariadb/2089/)

Assumed Example: Data Sync Chain: A -> B -> DM -> TiDB

At time t0:

A's current latest binlog gtid: 0_101_100 B syncs A's latest binlog gtid: 0_101_100 At time t1:

B's slave writes binlog with gtid changes: 0_102_101, 0_102_102 At time t2:

A writes with gtid change: 0_101_101 At time t2:

B backs up gtid sequence: 0_101_100 0_102_101 0_102_102 0_101_101 DM synchronization reports an error: 'less than global checkpoint position.'

At time t3 (Mariadb behavior):

Master-slave switch B's gtid (A consumption scenario):

0_101_100 0_102_101 (A consumes B) 0_102_102 (A consumes B) 0_101_101 (A Skip) From verification tests, in mariadb gtid_strict_mode = off mode, when there is a master-slave switch between A and B, B promotes to master, and Mariadb pulls binlog from the master based on its own gtid seq, ignoring seq_no comparisons. In the above example, it consumes data from t1, skipping data already present in 0_101_101."

What did you expect to see?

sync normally

What did you see instead?

'less than global checkpoint position.'

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

6.5.3

Upstream MySQL/MariaDB server version:

10.1.9

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

okJiang avatar Mar 08 '24 04:03 okJiang

Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.

https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups

However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a gtid_strict_mode = off behaviour :thinking: . I don't know if that behaviour will cause more trouble like data loss.

lance6716 avatar Mar 09 '24 15:03 lance6716

Is this the case of multiple master(writer/primary) node? We should setup separate domain ID for them.

https://mariadb.com/kb/en/gtid/#use-with-multi-source-replication-and-other-multi-primary-setups

However if the master node is caused by failover and multiple master nodes will not occur simultaneously, DM may need to configurate a gtid_strict_mode = off behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.

Yes, the best way is setting separate domain ID.

But this is also a headache. I learned today that users who encountered this problem were using TDSQL. In the internal implementation of TDSQL, they relied on mariadb's gtid_strict_mode=off and same domain ID to implement master-slave switching. It is difficult for us to push users to modify the implementation of TDSQL

Now I added server-id mapping in MariadbGTIDSet in go-mysql. https://github.com/go-mysql-org/go-mysql/pull/852 Then I successfully executed my local test case(https://github.com/pingcap/tiflow/pull/10753).

A question: Since it can be executed successfully at the source, should we also be able to synchronize it for the same domain-id and server-id 🤔? @GMHDBJD @lance6716

okJiang avatar Mar 11 '24 11:03 okJiang

DM may need to configurate a gtid_strict_mode = off behaviour 🤔 . I don't know if that behaviour will cause more trouble like data loss.

I was worried about this too, so I tried to find other ways

okJiang avatar Mar 12 '24 07:03 okJiang