tiflow
tiflow copied to clipboard
dm-worker hangs after running dm tasks with fast/full validators
What did you do?
- Run dm tasks with fast and full validators
- Run other DM cases, run dmctl query-status to get DM task status
What did you expect to see?
dm-worker should not hang, dmctl query-status should run success
What did you see instead?
dm-worker hangs, dmctl query-status return context deadline exceeded error
/ # /dm-master -V
Release Version: v6.3.0-alpha
Git Commit Hash: 2c12e2e60d786d61419fc19aa28691dbdf525b55
Git Branch: heads/refs/tags/v6.3.0-alpha
UTC Build Time: 2022-09-27 11:01:23
Go Version: go version go1.19.1 linux/amd64
Failpoint Build: false
/ # time /dmctl --master-addr=[127.0.0.1:8261](http://127.0.0.1:8261/) query-status
{
"result": true,
"msg": "",
"sources": [
{
"result": true,
"msg": "no sub task started",
"sourceStatus": {
"source": "source-0",
"worker": "dm-dm-worker-2",
"result": null,
"relayStatus": null
},
"subTaskStatus": [
]
},
{
"result": false,
"msg": "[code=38008:class=dm-master:scope=internal:level=high], Message: grpc request error, RawCause: rpc error: code = DeadlineExceeded desc = context deadline exceeded",
"sourceStatus": {
"source": "source-1",
"worker": "",
"result": null,
"relayStatus": null
},
"subTaskStatus": [
]
}
]
}
real 0m 30.15s
user 0m 0.05s
sys 0m 0.02s
/ # grep WARN /var/lib/dm-master/log/dm-master.log | tail
[2022/09/27 17:48:12.060 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:47:41.958 +00:00] ["time spent"=30.101792535s] [remote=[127.0.0.1:40152](http://127.0.0.1:40152/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:49:28.632 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:44:18.790 +00:00] ["time spent"=5m9.842540218s] [remote=[127.0.0.1:38842](http://127.0.0.1:38842/)] ["response type"=/pb.Master/OperateTask] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:49:29.223 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:48:59.121 +00:00] ["time spent"=30.102463024s] [remote=[127.0.0.1:40665](http://127.0.0.1:40665/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:50:45.723 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:45:36.547 +00:00] ["time spent"=5m9.175809748s] [remote=[127.0.0.1:39335](http://127.0.0.1:39335/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:52:05.587 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:46:56.189 +00:00] ["time spent"=5m9.397952776s] [remote=[127.0.0.1:39842](http://127.0.0.1:39842/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:53:21.104 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:48:11.972 +00:00] ["time spent"=5m9.132926971s] [remote=[127.0.0.1:40346](http://127.0.0.1:40346/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/27 17:54:38.529 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/27 17:49:29.125 +00:00] ["time spent"=5m9.404252609s] [remote=[127.0.0.1:40868](http://127.0.0.1:40868/)] ["response type"=/pb.Master/OperateSource] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:25:27.517 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:24:57.413 +00:00] ["time spent"=30.103021161s] [remote=[127.0.0.1:59562](http://127.0.0.1:59562/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:27:45.295 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:27:15.185 +00:00] ["time spent"=30.109207704s] [remote=[127.0.0.1:60382](http://127.0.0.1:60382/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
[2022/09/28 01:29:11.210 +00:00] [WARN] [interceptor.go:197] ["request stats"] [component="embed etcd"] ["start time"=2022/09/28 01:28:41.107 +00:00] ["time spent"=30.10232034s] [remote=[127.0.0.1:60906](http://127.0.0.1:60906/)] ["response type"=/pb.Master/QueryStatus] ["request count"=-1] ["request size"=-1] ["response count"=-1] ["response size"=-1] ["request content"=]
Versions of the cluster
master
current status of DM cluster (execute query-status <task-name>
in dmctl)
/ # /dm-master -V
Release Version: v6.3.0-alpha
Git Commit Hash: 2c12e2e60d786d61419fc19aa28691dbdf525b55
Git Branch: heads/refs/tags/v6.3.0-alpha
UTC Build Time: 2022-09-27 11:01:23
Go Version: go version go1.19.1 linux/amd64
Failpoint Build: false
dm-worker logs work1.dm-worker.log
When enter DataValidator.stopInner
by user action (stop-task, validator stop, etc) or data validating error, stopInner
will hold the lock and wait other validator goroutine exit. other validator goroutine may send error at https://github.com/pingcap/tiflow/blob/1961b63f2172ffcdc3609867996c0bf7f17e08f2/dm/syncer/validate_worker.go#L167-L171 or other place. But since the lock is hold by DataValidator.stopInner
, reading from error channel is blocked at https://github.com/pingcap/tiflow/blob/1961b63f2172ffcdc3609867996c0bf7f17e08f2/dm/syncer/data_validator.go#L456-L459 When error channel is full (validate worker number > 10), it will enter deadlock.
And when this deadlock is caused by user action, it will further block DM-worker's so query-status can't be responsed.
And I remember uber's code guide channel size is 0 or 1. If we follow it this bug will be easier to be exposed
/severity Major
/found automation