flink-cdc icon indicating copy to clipboard operation
flink-cdc copied to clipboard

[mysql]fix duplicate split request for newly added table

Open yujiaxinlong opened this issue 2 years ago • 5 comments

related to #1149

I debugged for a while and find the reason of stucking at the second split(not 100% percent second , could be third or later but always stuck) is that it triggers two split requests on one subtask at the same time.

When adding a new table to an existed flinkcdc task , if the old table has finished snapshot phase and has been consuming binlog. It will read some binlog splits before finding there is new table during task restart.

Here when a binlog split finished in MysqlSourceReader#onSplitFinished(...), it actually activate next split twice.

  1. SuspendBinlogReaderAckEvent -> WakeupReaderEvent -> context.sendSplitRequest()
  2. directly context.sendSplitRequest().

Which lead to two snapshot splits handled by one subtask , one split fetcher and one same debezium BinaryLogClient at the same time.

This BinaryLogClient mostly reach an EOF before the high watermark of the second split here, so the binlogSplitReadTask for the second split will never trigger an binlog end watermark event, which eventually lead to snapshotreader hangs forever.

The WakeupReaderEvent for snapshotsplit seems not needed. Currently I just removed the context.sendSplitRequest() here and keep this event for future usage. And this works fine in my case.

yujiaxinlong avatar May 07 '22 09:05 yujiaxinlong

@leonardBang @PatrickRen there should be something wrong with oracle connector or it's test case, past several unrelated PRs all failed on this.

yujiaxinlong avatar May 07 '22 11:05 yujiaxinlong

+1, the fix solved our online problem

lzshlzsh avatar May 19 '22 03:05 lzshlzsh

Would you explain how to reproduce the bug more detailed ? I have tried tens times according to #1149 , but could not reproduce.

You say: "Which lead to two snapshot splits handled by one subtask , one split fetcher and one same debezium BinaryLogClient at the same time."

There is no problem that multiple snapshot splits handled by one subtask, which happens when decrease parallelism(e.g 4 to 1) in snapshot stage.

lzshlzsh avatar May 19 '22 17:05 lzshlzsh

@lzshlzsh
I think multiple splits indeed can be handled by one subtask, but should not at the same time.

The problem here is that it starts to handle the second split before finishing first one. What I observe here is that the second split keeps waiting for it's watermark end event , which only happens when currentBinlogOffset.isAtOrAfter(binlogSplit.getEndingOffset()) see MySqlBinlogSplitReadTask#handleEvent(Event event). But it never get it.

I haven't read the code of BinaryLogClient very carefully, but I believe the reason is that BinaryLogClient doesn't read binlog endlessly. it read exactly amount of binlog that has been decided when it connects. so when the second split also use this client, it most likely reach an eof and jump out before reach the high watermark of second split.

I haven't reproduce this problem except in our online database, I think the core here is the table A I mentioned in #1149 must has finished snapshot reading and keeps having new DML sql.

yujiaxinlong avatar May 20 '22 04:05 yujiaxinlong

@leonardBang Hi, I'm wandering how is the review going, I checked code of BinaryLogClient recently and pretty sure the reason lead to this bug is what I described before. The BinaryLogClient for the first split won't read new log for the second.

yujiaxinlong avatar Sep 23 '22 10:09 yujiaxinlong