dm icon indicating copy to clipboard operation
dm copied to clipboard

online replication checksum

Open csuzhangxc opened this issue 5 years ago • 16 comments

Feature Request

Is your feature request related to a problem? Please describe:

now, we use syncer-diff-inspector to check data between upstream MySQL and downstream TiDB, but it can only check data which will not be updated during the checking process (these may be the whole MySQL instances, a database, a table, or part of data in a table with range specified in sync-diff-inspector config)

another similar issue is #688.

Describe the feature you'd like:

provides a complete online replication checksum feature.

NOTE: no extra data need to be written must be better.

Describe alternatives you've considered:

  • Data checksum: source and target database data comparison
  • Log checksum: calculating the checksum of the log to ensure that each log is applied regardless of the application result, suitable for log backup and distribution
  • Separately implement data checksum and log replication program
    • implement data checksum in the data validation program
    • implement log checksum in log replication program
  • data checksum
    • Divide chunks for checksum
    • Control the speed of checksum
    • do checksum
      • write help-data in the source database
      • lock chunk in the source database, or do multiple times checksum to avoid locking chunks
      • cooperate with data replication program to achieve incremental data checksum
    • Data checksum program can run independently, similar to sync-diff-inspector
    • Design interface to adapt to multiple data sources databases
  • log checksum
    • Divide chunks according to log position
    • calculate raw log checksum in a separate process
      • MySQL binlog
        • binlog event checksum
        • binlog event count
      • TiCDC/TiKV changed log
        • how to do it?
  • addition feature
    • specifies the maximum number of records that can fail validation before validation is suspended for the task
    • Specifies the delay before reporting any validation failure details.
    • show suspended records - some records in the table can't be validated, for example, if a record at the source is constantly being updated, we can't compare the source and the target
    • incremental data validation - identify on-going change logs and validate the changelogs row by row on the source and target database

Teachability, Documentation, Adoption, Migration Strategy:

csuzhangxc avatar Sep 24 '20 10:09 csuzhangxc

I think the best way to implement this is to use FLUSH TABLES WITH READ LOCK on the source mysql database and then stop replication at the correct GTID in the DM binlog player.

It's a bit fiddly but you can use a process that goes something like this:

  1. Stop the mysql->tidb binlog player
  2. Run FLUSH TABLES WITH READ LOCK on the source mysql and make a note of the GTID
  3. Create N connections (each worker thread needs its own connection) and start a transaction in each
  4. UNLOCK TABLES
  5. Restart the mysql->tidb binlog player until it reaches the GTID from step 2 above and make a note of the TiDB timestamp that this GTID resulted in
  6. Restart the mysql->tidb binlog player
  7. You can now run N parallel checksumming workers each using one of the connections created in 3 to diff against a TiDB connection synced to the TiDB timestamp from 5

This is what Vitess does to implement checksumming during a shard split.

tirsen avatar Sep 29 '20 07:09 tirsen

@tirsen 👍 we are planning to implement "stop replication at the correct GTID" (#348 is in our roadmap)

csuzhangxc avatar Sep 29 '20 07:09 csuzhangxc

@tirsen we think your above proposal is really good! and we only have a few details to append:

  • may need to use FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK instead of FLUSH TABLES WITH READ LOCK to only lock some tables which we are migrating, or even only flush & lock part of them to reduce the load on MySQL in each checking round
  • may need to consider the GC life time of TiDB and may abort the check with some timeout
  • when merging shard tables from multiple MySQL instances into one table in TiDB, we may need to stop the binlog player when reached the GTID for the first MySQL instance and re-start the binlog player until reached the GTID for the last MySQL instance, so than data in TiDB match all shards in MySQL
  • may need to abort the check if have a large replication lag between TiDB and MySQL
  • may need to abort the check if acquiring FTWRL takes too long
  • may still need to support only check part of the data in a table with update_time or any similar things

csuzhangxc avatar Sep 30 '20 03:09 csuzhangxc

@tirsen we think your above proposal is really good! and we only have a few details to append:

  • may need to use FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK instead of FLUSH TABLES WITH READ LOCK to only lock some tables which we are migrating, or even only flush & lock part of them to reduce the load on MySQL in each checking round

Ah yes good point.

Btw you will also need to have an option to use LOCK TABLES ... WITH READ LOCK because Aurora doesn't support FTWRL.

  • may need to consider the GC life time of TiDB and may abort the check with some timeout

Yeah, ideally we would increase the GC life time if necessary.

  • when merging shard tables from multiple MySQL instances into one table in TiDB, we may need to stop the binlog player when reached the GTID for the first MySQL instance and re-start the binlog player until reached the GTID for the last MySQL instance, so than data in TiDB match all shards in MySQL

I think each shard needs to be checksummed separately? With a separate GTID/transaction timestamp, separate checksum process and separate checksumming connection.

  • may need to abort the check if have a large replication lag between TiDB and MySQL

Yes.

  • may need to abort the check if acquiring FTWRL takes too long

Yes. You need to run this during low traffic so that FTWRL completes for all tables. There will be a brief outage during FTWRL so that's not something you want during peak traffic.

  • may still need to support only check part of the data in a table with update_time or any similar things

Not sure what you mean here. :-) You mean that we can add a configurable WHERE updated_at > ? clause to the comparison to decrease the set of rows to checksum? Yeah that's a nice feature.

tirsen avatar Sep 30 '20 07:09 tirsen

I think each shard needs to be checksummed separately? With a separate GTID/transaction timestamp, separate checksum process and separate checksumming connection.

when MySQL-1-GTID-A == TiDB-TSO-A, there are rows in TiDB like:

1, a  # from MySQL-1
2, b  # from MySQL-2
3, c  # from MySQL-1

but when MySQL-2-GTID-B == TiDB-TSO-B, there are rows in TiDB like:

1, a  # from MySQL-1
2, b  # from MySQL-2
3, c  # from MySQL-1
4, d  # from MySQL-1
5, e  # from MySQL-2

it's a bit hard to check data consistency both for MySQL-1-GTID-A and MySQL-2-GTID-B by checksum because there isn't a clear WHERE clause can match data from MySQL-1 or MySQL-2 in TiDB.

so may need to block replication for MySQL-1-GTID-A until MySQL-2-GTID-B with rows in TiDB like:

1, a  # from MySQL-1
2, b  # from MySQL-2
3, c  # from MySQL-1
# no 4, d here
5, e  # from MySQL-2

then we can calculate checksums for MySQL-1 and MySQL-2 separately and XOR (or other methods) them to compare the checksum calculated in TiDB.

csuzhangxc avatar Sep 30 '20 08:09 csuzhangxc

You mean that we can add a configurable WHERE updated_at > ? clause to the comparison to decrease the set of rows to checksum?

YES

csuzhangxc avatar Sep 30 '20 08:09 csuzhangxc

Btw you will also need to have an option to use LOCK TABLES ... WITH READ LOCK because Aurora doesn't support FTWRL.

It seems can't FLUSH TABLES WITH READ LOCK in Aurora (without SUPER privilege), but can FLUSH TABLES tbl_name [, tbl_name] ... WITH READ LOCK (with RELOAD privilege, no need SUPER) in Aurora

csuzhangxc avatar Sep 30 '20 09:09 csuzhangxc

Saw the above discussion on how to do data checksum. Using FTWRL/LTWRL get consistent snapshots, If we need to migrate data from multiple shard tables to one TiDB table, then multiple consistent snapshots are required.

This scheme is very good. But I have some concern about FTWRL, FTWRL has a relatively large impact on the database, and some DBAs I know dare not use this feature in the master database. This is just a concern, but if data comparisons are frequent, then we need to pay attention to this concern.

In addition, my understanding of real-time/online replication checksum is an incremental checksum, not a full data checksum mechanism, the closest way may only choose some chunks to do do the data checksum. How to choose chunks is a question worth considering.

Based on the above ideas, I propose an optimistic data checksum schema - assuming that a chunk will not be updated for a period of time (or later), then we can verify it multiple times.

  • If the verification is consistent, the verification passes
  • If the number of times exceeded is still inconsistent, then an error is reported and the data is inconsistent, and need to use other data checksum way, such as the scheme discussed above

In this way, we may be possible to save the need to lock and control the DM replication task

IANTHEREAL avatar Oct 09 '20 07:10 IANTHEREAL

Saw the above discussion on how to do data checksum. Using FTWRL/LTWRL get consistent snapshots, If we need to migrate data from multiple shard tables to one TiDB table, then multiple consistent snapshots are required.

This scheme is very good. But I have some concern about FTWRL, FTWRL has a relatively large impact on the database, and some DBAs I know dare not use this feature in the master database. This is just a concern, but if data comparisons are frequent, then we need to pay attention to this concern.

In addition, my understanding of real-time/online replication checksum is an incremental checksum, not a full data checksum mechanism, the closest way may only choose some chunks to do do the data checksum. How to choose chunks is a question worth considering.

Based on the above ideas, I propose an optimistic data checksum schema - assuming that a chunk will not be updated for a period of time (or later), then we can verify it multiple times.

  • If the verification is consistent, the verification passes
  • If the number of times exceeded is still inconsistent, then an error is reported and the data is inconsistent, and need to use other data checksum way, such as the scheme discussed above

In this way, we may be possible to save the need to lock and control the DM replication task

I love this two-level checksum proposal, what's your opinion? @tirsen

csuzhangxc avatar Oct 09 '20 07:10 csuzhangxc

If we don't lock tables, upstream and downstream may have some different data, but not all data are different. So if split them into chunks by some good WHERE clause, chunks that contains cold data only could pass verification without lock table in "incremental" checksum. And if switched to "lock table" checksum and still some chunks are not touched during binlog replication, these chunks could be skipped to shorten compare time.

lance6716 avatar Oct 09 '20 08:10 lance6716

Yeah we only run these diffs against stand by replicas so FTWRL is not a problem.

Doesn't chunk checksumming require a special updated_at column maintained by the application? That makes it less generally useful...

tirsen avatar Oct 09 '20 10:10 tirsen

Oh I see you just check it a few times until it succeeds? Yeah that might work! That's really nice!

tirsen avatar Oct 09 '20 10:10 tirsen

@tirsen sorry for the late reply! Yep, the optimistic data checksum schema would try to check it a few times until it succeeds.

I have a project implementation question, do you want to implement it as a more general library, like make upstream data fetching, chunk splitting, data checksum &comparison, waiting for verification timing as interfaces?

IANTHEREAL avatar Oct 13 '20 01:10 IANTHEREAL

There are two ideas proposed above. If other people have no other new ideas, I think we can choose a plan. What do you think? @lance6716 @tirsen @csuzhangxc

IANTHEREAL avatar Oct 13 '20 01:10 IANTHEREAL

I think we need both of them for different scenarios, but we may choose to implement optimistic data checksum first.

csuzhangxc avatar Oct 15 '20 03:10 csuzhangxc

optimistic data checksum sounds like it would certainly work very well for us.

tirsen avatar Jul 27 '21 07:07 tirsen