flink-cdc icon indicating copy to clipboard operation
flink-cdc copied to clipboard

[FLINK-35674][cdc-connector][mysql]Fix blocking caused by searching for timestamp in binlog file

Open ThorneANN opened this issue 1 year ago • 5 comments

ThorneANN avatar Jun 24 '24 07:06 ThorneANN

@yuxiqian PTAL

ThorneANN avatar Jun 26 '24 11:06 ThorneANN

Thanks for the update, now we know the blocking is caused by two reasons:

  1. We did not deal with empty queue.
  2. The BinaryLogClient is missing LifecycleListener, or more specifically, missing throwing exception in onCommunicationFailure.

To fix it, I prefer to add a anonymous LifecycleListener which throw the exception directly, and for empty queue, just return null and remove the current binlog file from the list. This fix could be straightforward and easy to test (using binlog file that only contains rotate events and using same server id). What do you think?

whhe avatar Jun 26 '24 13:06 whhe

Thanks for the update, now we know the blocking is caused by two reasons:

  1. We did not deal with empty queue.
  2. The BinaryLogClient is missing LifecycleListener, or more specifically, missing throwing exception in onCommunicationFailure.

To fix it, I prefer to add a anonymous LifecycleListener which throw the exception directly, and for empty queue, just return null and remove the current binlog file from the list. This fix could be straightforward and easy to test (using binlog file that only contains rotate events and using same server id). What do you think?

I agree most of those views ,but why throw the exception to end a job instead of retrying . In my tests, blocking task just retry once and it work normally .Here are some of my views: 1、 If a flink-cdc job always start with exception to end a job ,because BinaryLogClient listener registration failed , what we should do for this situation? 2、we only can remove the current binlog file ,because it only contains rotate events . 3、A new flink-cdc job will take some times to real work .

ThorneANN avatar Jun 26 '24 14:06 ThorneANN

Directly throwing the exception is the simplest solution, to deal with it better, you can also retry connecting for some specific exceptions, but it should not retry for all exceptions. I think it's not safe to directly go to the retry logic here.

By the way, there should be similar implementations in debezium, I'm not sure. I will check it out later.

whhe avatar Jun 26 '24 15:06 whhe

How can we retry when I fail to register the listener communication? By the way, I also looking forword to find similar implementations in debezium

ThorneANN avatar Jun 26 '24 15:06 ThorneANN

The reason for the registration failure of BinaryLogClient is that if the server id is not set, BinaryLogClient will use the default 65535 as server-id, so we need to register the server id in each source.

ThorneANN avatar Jul 15 '24 09:07 ThorneANN

Binlog file must has timestamp in v4 ,so we should don't consider the empty timestamp.

ThorneANN avatar Jul 15 '24 09:07 ThorneANN

If i flush log with mysql client twice and the binlog file must has FormatDescriptionEventData event and timestamp.

ThorneANN avatar Jul 15 '24 09:07 ThorneANN

@whhe PTAL

ThorneANN avatar Jul 22 '24 07:07 ThorneANN

I think adding a server id can indeed solve the issue, and a LifecycleListener is also needed for other exceptions. Can you add a basic implementation like ReaderThreadLifecycleListener please?

whhe avatar Jul 22 '24 13:07 whhe

I think adding a server id can indeed solve the issue, and a LifecycleListener is also needed for other exceptions. Can you add a basic implementation like ReaderThreadLifecycleListener please?

where i can add a basic implementation like ReaderThreadLifecycleListener ?

ThorneANN avatar Jul 23 '24 01:07 ThorneANN

May be we should consider fix the BinaryLogClient with LifecycleListener in a new pr ,because another pr also should be fixed https://github.com/apache/flink-cdc/pull/1915/

ThorneANN avatar Jul 23 '24 01:07 ThorneANN

May be we should consider fix the BinaryLogClient with LifecycleListener in a new pr ,because another pr also should be fixed https://github.com/apache/flink-cdc/pull/1915/

+1, creating a new pr for it makes sense to me.

whhe avatar Jul 23 '24 02:07 whhe

@shiyiky Thanks for this PR. Please add some description about this bug and changes from this PR.

ruanhang1993 avatar Aug 01 '24 06:08 ruanhang1993

@ruanhang1993 Desc:

  1. If the binaryLogClient does not have a registered server-id, the binaryLogClient will use the default 65535 parameter. If there is only one source in a Flink job, it will be normal.
  2. If multiple sources register binaryLogClients using the default 65535 in a Flink job, registration may fail because one of the sources is searching for the timestamp of the binlog file and has not yet released the client registered using 65535 as the server-id.
  3. If registration fails. BlockingQueue will always be empty and waiting without doing nothing.

Changes: When registering the binaryLogClient to search timestamp , use the server-id unique to each source for registration.

ThorneANN avatar Aug 01 '24 07:08 ThorneANN