[FLINK-37065]: MySQL cdc fix method 'fixRestoredGtidSet' to support gt…
This PR closes: FLINK-37065 ticket.
The new fixRestoredGtidSet implementation preserves GTID gaps by merging the server and restored GTID sets interval-by-interval, ensuring no unprocessed transactions are skipped or flattened.
@ruanhang1993 Would you like to help review this PR?
Hey @ruanhang1993 Just wanted to ask if you could review my PR. Thanks
This pull request has been automatically marked as stale because it has not had recent activity for 120 days. It will be closed in 60 days if no further activity occurs.
@leonardBang @xiaom @ruanhang1993 Can you please review it? I think this is a very important PR since MySQLCDC can lose/skip data during recovering from the checkpoint.
@leonardBang @xiaom @ruanhang1993 Could you please review this PR?
@mielientiev Thanks for your contribution, I'll review this PR ASAP. Before the review work, could you kindly rebase to latest master branch to fix current CI failed cases?
@leonardBang done - rebased! but now I don't see status of pipelines
@leonardBang could you please review it as well?
Hi @mielientiev, I could understand the problem you are hoping to solve, but I don't understand why https://github.com/apache/flink-cdc/pull/2220 not addressing this issue, as it said:
When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to A:1-102, 105-150.
Such gaps are temporary but will eventually be consistent.
But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure.
Is it because the previous analysis was not entirely accurate?
@lvyanquan The analysis is partially correct. MySQL servers can have gaps in their executed transactions (e.g., A:1-102,105-150), and these gaps may or may not get filled later. This isn't a problem.
However, there's something the analysis missed. When MySQL processes transactions in parallel, they can appear out of order in the binlog. For example, even though the server executed transactions A:1-10, the binlog might show them like this when sending to Flink CDC:
-Tx, A:1
-Tx, A:2
-Tx, A:4
-Tx, A:6
-Tx, A:5
-Tx, A:3
-Tx, A:7
-Tx, A:8
-Tx, A:10
-Tx, A:9
Now imagine our application stopped and saved its state after processing transaction A:6. The restored GTID set would be A:1-2:4:6 (we've seen 1, 2, 4, and 6, but not 3 or 5 yet).
When we restart, we need to tell MySQL which transactions we've already processed so it knows what to send next. The problem is that the current code merges these gaps and produces A:1-6 as restored value. This tells MySQL we've processed ALL transactions from 1 to 6, including 3 and 5. MySQL then skips transactions 3 and 5, causing data loss.
I hope it's clear now
Also @lzshlzsh did a great description for this issue here https://issues.apache.org/jira/browse/FLINK-38183
Can we add another test to demonstrate that such GTIDs can be used to establish a binary log connection and read the intermediate transactions to avoid data loss?
@lvyanquan sure. Just added the test with binlog connector. Please review it again and let me know what do you think
LGTM. A small problem, you need to add copyright to the file MysqlGtidRecoveryTest.java
Error: Failed to execute goal org.apache.rat:apache-rat-plugin:0.12:check (default) on project flink-cdc-parent: Too many files with unapproved license: 1 See RAT report in: /home/runner/work/flink-cdc/flink-cdc/target/rat.txt -> [Help 1]
@lvyanquan Do you know how to retrigger pipeline? seems like false failure
Hi @ruanhang1993 could you take a look at this?
Thanks @mielientiev @lzshlzsh for the fix.
The remaining unhandled edge cases will be traced in https://issues.apache.org/jira/browse/FLINK-38380.