flink-cdc icon indicating copy to clipboard operation
flink-cdc copied to clipboard

[mysql-cdc] Fix the hung up of snapshot phase when reuse binaryLogClient

Open lzshlzsh opened this issue 2 years ago • 5 comments

Because callback( eventListeners and lifecycleListeners) of BinaryLogClient is a list, and BinaryLogClient may reuse (see MySqlSplitReader#checkSplitOrStartNext),when multiple snapshotSplits are submitted to a SnapshotSplitReader, the callback list contains already processed snapshotSplits's MySqlBinlogSplitReadTask#handleEvent。When a binlog event arrives, the processed snapshot's callbacks are called and causes the current snapshot's BackfillBinlogReadTask's execute function end before get the BINLOG_END watermark event. So the snapshot phase hangs.

The following is the log of our online environment, we can see muliple MySqlStreamingChangeEventSource(super calss of MySqlBinlogSplitReadTask) callbacks of different snapshotSplits.

io.debezium.connector.mysql.MySqlStreamingChangeEventSource - XXX: eventListeners(7): com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@61540cca,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@352b5758,io.debezium.connector.mysql.MySqlStreamingChangeEventSource$$Lambda$1014/1247290871@703f0cf,io.debezium.connector.mysql.MySqlStreamingChangeEventSource$$Lambda$1015/190751860@5a253136,io.debezium.connector.mysql.MySqlStreamingChangeEventSource$$Lambda$1016/10641269@12fef255,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@18c84a61,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@55443f, lifecycleListeners(5): com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@61540cca,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@352b5758,io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener@730a6982,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@18c84a61,com.github.shyiko.mysql.binlog.jmx.BinaryLogClientStatistics@55443f

We believe, the imporper use of mysql BinlogClient is the root cause of some task hung up issues, such as #1156。

lzshlzsh avatar Feb 14 '23 03:02 lzshlzsh

@leonardBang @kylemeow @minchowang Would you help to look at this problem.

lzshlzsh avatar Feb 14 '23 03:02 lzshlzsh

We just encountered this problem online. The snapshot stage is stuck, and the problem is solved after this repair. @minchowang

lzshlzsh avatar Feb 22 '23 08:02 lzshlzsh

Thanks @lzshlzsh for the detail report and fix! I'll review this PR asap

leonardBang avatar Feb 22 '23 13:02 leonardBang

Hi @lzshlzsh, thanks for your contribution! Before this PR could be merged, could you please rebase it with latest master branch?

cc @leonardBang @PatrickRen

yuxiqian avatar Apr 26 '24 02:04 yuxiqian

This pull request has been automatically marked as stale because it has not had recent activity for 60 days. It will be closed in 30 days if no further activity occurs.

github-actions[bot] avatar Jul 17 '24 00:07 github-actions[bot]