flink-cdc icon indicating copy to clipboard operation
flink-cdc copied to clipboard

[FLINK-38218] Fix MySQL CDC binlog split metadata split transmission

Open morozov opened this issue 7 months ago • 4 comments

The root cause is that the binlog split metadata transfer protocol relies on the order of finished snapshot split infos to be stable and corresponding to the order of split assignment (the infos of newly added/snapshotted tables are appended to the end of the list). However, when MySqlSnapshotSplitAssigner is restored from state, assignedSplits are reordered, which breaks this assumption.

Change summary

  1. Require assigned snapshot splits to be ordered. This isn't strictly necessary to fix the bug but follows directly from the JavaDoc I added to MySqlSnapshotSplitAssigner#assignedSplits. If the order is important, the type should guarantee that it's preserved. Note the changes in the deserialization code. Not using an ordered map there while the order is important may cause other hard to diagnose issues.
  2. Rely on stable order of assigned splits. Instead of identifying duplicate received split infos by split ID, ignore the first N elements that we know we already have.
  3. Eliminate code duplication in MySqlBinlogSplit constructors. There are currently two constructors where one doesn't call the other. The subsequent commit adds a check that needs to be enforced regardless of which of the constructors was used, so I'm combining them.
  4. Enforce no duplicate finished snapshot split infos in MySqlBinlogSplit. By design, a binlog split cannot contain duplicate finished snapshot split infos. If it does, it indicates the fact that it was constructed incorrectly. If it happens, it's a bug, and we want to fail as early as possible.

morozov avatar Aug 08 '25 21:08 morozov

I'm not sure how to test this. The issue is reproducible if a source is restarted mid-snapshot of a newly added table and requires consuming the changes in the new table from the binlog. Could maintainers recommend an existing test on top of which I could build this?

morozov avatar Aug 08 '25 21:08 morozov

@leonardBang could you restart the tests? The logs are no longer available.

morozov avatar Dec 10 '25 05:12 morozov

Hey, @morozov Looks like Azure cannot re-trigger expired CI, could you rebase your PR to latest master branch to trigger new CI ?

leonardBang avatar Dec 10 '25 05:12 leonardBang

@leonardBang, after the rebase, the build is green.

morozov avatar Dec 10 '25 16:12 morozov