Pausing a changefeed which resolvedTS is already stuck may block the etcdWorker thread of the entire cluster
What did you do?
(internal reference GTOC-7966)
- Make a cluster running several changefeeds A, B, C.
- Force one changefeed A to become stuck, e.g. through #12162. (The stuck region contained 318 tables)
- Pause the changefeed A and resume it.
What did you expect to see?
Only changefeed A is affected.
What did you see instead?
All changefeeds A, B, C are resolvedTS-stuck. The log shows warnings like
[WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.709235922s] [role=processor]
in the entire cluster.
Versions of the cluster
v7.5.6
In the log we see this is caused by the SourceManager not being closed:
[INFO] [processor.go:896] ["processor closing ..."] [namespace=default] [changefeed=A]
[INFO] [processor.go:1051] ["processor sub-component is in stopping"] [namespace=default] [changefeed=A] [name=SinkManager]
[INFO] [manager.go:211] ["Sink manager exists"] [namespace=default] [changefeed=A] [error="context canceled"]
[INFO] [manager.go:1063] ["Closing sink manager"] [namespace=default] [changefeed=A]
[INFO] [manager.go:1075] ["Closed sink manager"] [namespace=default] [changefeed=A] [cost=174.869µs]
[INFO] [processor.go:1051] ["processor sub-component is in stopping"] [namespace=default] [changefeed=A] [name=SourceManager]
# no ["Closing source manager"] and ["Closed source manager"] afterwards
referring to the stack trace reaching `(*SourceManager).Close()`,
github.com/pingcap/tiflow/cdc/processor/sourcemanager.(*SourceManager).Close
github.com/pingcap/tiflow/cdc/processor/sourcemanager/manager.go:281
github.com/pingcap/tiflow/cdc/processor.(*component[...]).stop
github.com/pingcap/tiflow/cdc/processor/processor.go:1057
github.com/pingcap/tiflow/cdc/processor.(*processor).Close
github.com/pingcap/tiflow/cdc/processor/processor.go:906
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).closeProcessor
github.com/pingcap/tiflow/cdc/processor/manager.go:284
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick
github.com/pingcap/tiflow/cdc/processor/manager.go:116
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run
github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker
github.com/pingcap/tiflow/cdc/capture/capture.go:596
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4
github.com/pingcap/tiflow/cdc/capture/capture.go:400
golang.org/x/sync/errgroup.(*Group).Go.func1
golang.org/x/[email protected]/errgroup/errgroup.go:78
we determined the runEtcdWorker thread is stuck at this line:
https://github.com/pingcap/tiflow/blob/15afae82c9f095252dfd1824d59b38b05f87f8fc/cdc/processor/processor.go#L1056
The c.wg is supposed to be cleared (.Done()) after (*SourceManager).Run() is complete:
https://github.com/pingcap/tiflow/blob/15afae82c9f095252dfd1824d59b38b05f87f8fc/cdc/processor/processor.go#L1021-L1035
meaning (*SourceManager).Run() is not canceled by c.cancel(). We are not sure in which branch the of the multiplexing puller this is stuck in.