tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

The syncing task is running, but in fact binlog event isn't syncing to downstream

Open giant-panda666 opened this issue 1 year ago • 8 comments

What did you do?

  1. runing syncing task, the worker log as below: image
  2. after some time, the task is still running, but syncerBinlog pos don't be updated: image
  3. if i run resume command, it returns error: image and syncerBinlog in subTaskStatus still don't be updated image
  4. if i run pause command and resume command, worker returns error and then restart syncer. syncerBinlog pos is updated too. image image

dm version is v7.5.1 deployed by tidb operator.

What did you expect to see?

No response

What did you see instead?

No response

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)

giant-panda666 avatar Jul 05 '24 20:07 giant-panda666

please provide full log so we know what's happened. Maybe you can also check the binlog event at that position to see its content, and capture the goroutine stack of DM-worker.

lance6716 avatar Jul 10 '24 02:07 lance6716

All logs as above. This is the goroutine stack: goroutine.txt

giant-panda666 avatar Jul 10 '24 07:07 giant-panda666

goroutine.txt

At the moment of dumping goroutine, dm-worker is waiting to read binlog from relay component, and relay is waiting to read binlog from upstream MySQL. Do you see the lag at that moment?

lance6716 avatar Jul 10 '24 08:07 lance6716

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.

giant-panda666 avatar Jul 10 '24 08:07 giant-panda666

There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.

Just double check, the goroutine was dumped at the time that there's lag between relayBinlog and syncerBinlog? If so, there's a relay.meta file under the relay log folder of dm-worker's relay folder. Can you show its content and the directory structure of relay folder?

lance6716 avatar Jul 10 '24 08:07 lance6716

Yes,when i received the alert, i made the goroutine dump and then recovered it by pause-task and resume-task. Maybe i need to wait it reproduced and then i take relay.meta and directory structure down, because it has be recovered.

giant-panda666 avatar Jul 10 '24 08:07 giant-panda666

goroutine.txt

One qeustion: why there are 26 connections(grep goroutine.txt by "created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher in goroutine 294") created by handleSourceBound?syncer worker-count is 16 and validators worker-count is 4, so what about there are another 6 connections?

giant-panda666 avatar Jul 10 '24 08:07 giant-panda666

@lance6716 It's reproduced. dm-worker log: image relay.meta: image query-status result: image

data.012136 was just created after i caputured the dm-worker log, so the relay log works well. Maybe it's related with reboot of dm-worker' host which dm-master thinks that syner task is running but in fact not, it doesn't be reproduced until reboot of host.

giant-panda666 avatar Jul 25 '24 16:07 giant-panda666