flink-cdc icon indicating copy to clipboard operation
flink-cdc copied to clipboard

[FLINK-37065]: MySQL cdc fix method 'fixRestoredGtidSet' to support gt…

Open mielientiev opened this issue 11 months ago • 3 comments

This PR closes: FLINK-37065 ticket.

The new fixRestoredGtidSet implementation preserves GTID gaps by merging the server and restored GTID sets interval-by-interval, ensuring no unprocessed transactions are skipped or flattened.

mielientiev avatar Jan 08 '25 12:01 mielientiev

@ruanhang1993 Would you like to help review this PR?

leonardBang avatar Jan 08 '25 12:01 leonardBang

Hey @ruanhang1993 Just wanted to ask if you could review my PR. Thanks

mielientiev avatar Jan 14 '25 10:01 mielientiev

This pull request has been automatically marked as stale because it has not had recent activity for 120 days. It will be closed in 60 days if no further activity occurs.

github-actions[bot] avatar May 15 '25 00:05 github-actions[bot]

@leonardBang @xiaom @ruanhang1993 Can you please review it? I think this is a very important PR since MySQLCDC can lose/skip data during recovering from the checkpoint.

mielientiev avatar Jun 21 '25 08:06 mielientiev

@leonardBang @xiaom @ruanhang1993 Could you please review this PR?

mielientiev avatar Aug 02 '25 13:08 mielientiev

@mielientiev Thanks for your contribution, I'll review this PR ASAP. Before the review work, could you kindly rebase to latest master branch to fix current CI failed cases?

leonardBang avatar Aug 04 '25 01:08 leonardBang

@leonardBang done - rebased! but now I don't see status of pipelines

mielientiev avatar Aug 05 '25 21:08 mielientiev

@leonardBang could you please review it as well?

mielientiev avatar Aug 06 '25 21:08 mielientiev

Hi @mielientiev, I could understand the problem you are hoping to solve, but I don't understand why https://github.com/apache/flink-cdc/pull/2220 not addressing this issue, as it said:

When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to A:1-102, 105-150. 
Such gaps are temporary but will eventually be consistent. 
But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure.

Is it because the previous analysis was not entirely accurate?

lvyanquan avatar Aug 07 '25 12:08 lvyanquan

@lvyanquan The analysis is partially correct. MySQL servers can have gaps in their executed transactions (e.g., A:1-102,105-150), and these gaps may or may not get filled later. This isn't a problem.

However, there's something the analysis missed. When MySQL processes transactions in parallel, they can appear out of order in the binlog. For example, even though the server executed transactions A:1-10, the binlog might show them like this when sending to Flink CDC:

-Tx, A:1
-Tx, A:2
-Tx, A:4
-Tx, A:6
 
-Tx, A:5
-Tx, A:3
-Tx, A:7
-Tx, A:8
-Tx, A:10
-Tx, A:9

Now imagine our application stopped and saved its state after processing transaction A:6. The restored GTID set would be A:1-2:4:6 (we've seen 1, 2, 4, and 6, but not 3 or 5 yet).

When we restart, we need to tell MySQL which transactions we've already processed so it knows what to send next. The problem is that the current code merges these gaps and produces A:1-6 as restored value. This tells MySQL we've processed ALL transactions from 1 to 6, including 3 and 5. MySQL then skips transactions 3 and 5, causing data loss.

I hope it's clear now

Also @lzshlzsh did a great description for this issue here https://issues.apache.org/jira/browse/FLINK-38183

mielientiev avatar Aug 07 '25 17:08 mielientiev

Can we add another test to demonstrate that such GTIDs can be used to establish a binary log connection and read the intermediate transactions to avoid data loss?

lvyanquan avatar Aug 10 '25 02:08 lvyanquan

@lvyanquan sure. Just added the test with binlog connector. Please review it again and let me know what do you think

mielientiev avatar Aug 10 '25 21:08 mielientiev

LGTM. A small problem, you need to add copyright to the file MysqlGtidRecoveryTest.java

Error:  Failed to execute goal org.apache.rat:apache-rat-plugin:0.12:check (default) on project flink-cdc-parent: Too many files with unapproved license: 1 See RAT report in: /home/runner/work/flink-cdc/flink-cdc/target/rat.txt -> [Help 1]

lzshlzsh avatar Aug 11 '25 12:08 lzshlzsh

@lvyanquan Do you know how to retrigger pipeline? seems like false failure

mielientiev avatar Aug 13 '25 18:08 mielientiev

Hi @ruanhang1993 could you take a look at this?

lvyanquan avatar Sep 18 '25 02:09 lvyanquan

Thanks @mielientiev @lzshlzsh for the fix.

lvyanquan avatar Sep 18 '25 09:09 lvyanquan

The remaining unhandled edge cases will be traced in https://issues.apache.org/jira/browse/FLINK-38380.

lvyanquan avatar Sep 18 '25 12:09 lvyanquan