The syncing task is running, but in fact binlog event isn't syncing to downstream
What did you do?
- runing syncing task, the worker log as below:
- after some time, the task is still running, but syncerBinlog pos don't be updated:
- if i run resume command, it returns error:
and syncerBinlog in subTaskStatus still don't be updated
- if i run pause command and resume command, worker returns error and then restart syncer. syncerBinlog pos is updated too.
dm version is v7.5.1 deployed by tidb operator.
What did you expect to see?
No response
What did you see instead?
No response
Versions of the cluster
DM version (run dmctl -V or dm-worker -V or dm-master -V):
(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)
Upstream MySQL/MariaDB server version:
(paste upstream MySQL/MariaDB server version here)
Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
(paste TiDB cluster version here)
How did you deploy DM: tiup or manually?
(leave TiUP or manually here)
Other interesting information (system version, hardware config, etc):
>
>
current status of DM cluster (execute query-status <task-name> in dmctl)
(paste current status of DM cluster here)
please provide full log so we know what's happened. Maybe you can also check the binlog event at that position to see its content, and capture the goroutine stack of DM-worker.
All logs as above. This is the goroutine stack: goroutine.txt
goroutine.txt
At the moment of dumping goroutine, dm-worker is waiting to read binlog from relay component, and relay is waiting to read binlog from upstream MySQL. Do you see the lag at that moment?
There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0 evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.
There was no lag at that moment and query-status shows that masterBinlog and relayBinlog was almost the same, but syncerBinlog was stopped at last binlog file. The alert rule is
dm_syncer_binlog_file{node="master"} - on(pod, task) dm_syncer_binlog_file{node="syncer"} > 0evaluting every 1m for 10m,so when i found this problem it has passed at least 10m.
Just double check, the goroutine was dumped at the time that there's lag between relayBinlog and syncerBinlog? If so, there's a relay.meta file under the relay log folder of dm-worker's relay folder. Can you show its content and the directory structure of relay folder?
Yes,when i received the alert, i made the goroutine dump and then recovered it by pause-task and resume-task. Maybe i need to wait it reproduced and then i take relay.meta and directory structure down, because it has be recovered.
goroutine.txt
One qeustion: why there are 26 connections(grep goroutine.txt by "created by github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher in goroutine 294") created by handleSourceBound?syncer worker-count is 16 and validators worker-count is 4, so what about there are another 6 connections?
@lance6716 It's reproduced.
dm-worker log:
relay.meta:
query-status result:
data.012136 was just created after i caputured the dm-worker log, so the relay log works well. Maybe it's related with reboot of dm-worker' host which dm-master thinks that syner task is running but in fact not, it doesn't be reproduced until reboot of host.