tiflow The syncing task is running, but in fact binlog event isn't syncing to downstream

What did you do?

runing syncing task, the worker log as below:
after some time, the task is still running, but syncerBinlog pos don't be updated:
if i run resume command, it returns error: and syncerBinlog in subTaskStatus still don't be updated
if i run pause command and resume command, worker returns error and then restart syncer. syncerBinlog pos is updated too.

dm version is v7.5.1 deployed by tidb operator.

What did you expect to see?

No response

What did you see instead?

No response

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

Jul 05 '24 20:07 giant-panda666

please provide full log so we know what's happened. Maybe you can also check the binlog event at that position to see its content, and capture the goroutine stack of DM-worker.

Jul 10 '24 02:07 lance6716

All logs as above. This is the goroutine stack: goroutine.txt

Jul 10 '24 07:07 giant-panda666

goroutine.txt

At the moment of dumping goroutine, dm-worker is waiting to read binlog from relay component, and relay is waiting to read binlog from upstream MySQL. Do you see the lag at that moment?

Jul 10 '24 08:07 lance6716

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m，so when i found this problem it has passed at least 10m.

Jul 10 '24 08:07 giant-panda666

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m，so when i found this problem it has passed at least 10m.

Just double check, the goroutine was dumped at the time that there's lag between relayBinlog and syncerBinlog? If so, there's a relay.meta file under the relay log folder of dm-worker's relay folder. Can you show its content and the directory structure of relay folder?

Jul 10 '24 08:07 lance6716

Yes，when i received the alert, i made the goroutine dump and then recovered it by pause-task and resume-task. Maybe i need to wait it reproduced and then i take relay.meta and directory structure down, because it has be recovered.

Jul 10 '24 08:07 giant-panda666

goroutine.txt

One qeustion: why there are 26 connections(grep goroutine.txt by "created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher in goroutine 294") created by handleSourceBound？syncer worker-count is 16 and validators worker-count is 4, so what about there are another 6 connections?

Jul 10 '24 08:07 giant-panda666

@lance6716 It's reproduced. dm-worker log: relay.meta: query-status result:

data.012136 was just created after i caputured the dm-worker log, so the relay log works well. Maybe it's related with reboot of dm-worker' host which dm-master thinks that syner task is running but in fact not, it doesn't be reproduced until reboot of host.

Jul 25 '24 16:07 giant-panda666

The syncing task is running, but in fact binlog event isn't syncing to downstream

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

current status of DM cluster (execute `query-status <task-name>` in dmctl)