Pausing a changefeed which resolvedTS is already stuck may block the etcdWorker thread of the entire cluster

Open kennytm opened this issue 2 months ago • 1 comments

What did you do?

(internal reference GTOC-7966)

Make a cluster running several changefeeds A, B, C.
Force one changefeed A to become stuck, e.g. through #12162. (The stuck region contained 318 tables)
Pause the changefeed A and resume it.

What did you expect to see?

Only changefeed A is affected.

What did you see instead?

All changefeeds A, B, C are resolvedTS-stuck. The log shows warnings like

[WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.709235922s] [role=processor]

in the entire cluster.

Versions of the cluster

v7.5.6

Oct 16 '25 07:10 kennytm

In the log we see this is caused by the SourceManager not being closed:

[INFO] [processor.go:896] ["processor closing ..."] [namespace=default] [changefeed=A]
[INFO] [processor.go:1051] ["processor sub-component is in stopping"] [namespace=default] [changefeed=A] [name=SinkManager]
[INFO] [manager.go:211] ["Sink manager exists"] [namespace=default] [changefeed=A] [error="context canceled"]
[INFO] [manager.go:1063] ["Closing sink manager"] [namespace=default] [changefeed=A]
[INFO] [manager.go:1075] ["Closed sink manager"] [namespace=default] [changefeed=A] [cost=174.869µs]
[INFO] [processor.go:1051] ["processor sub-component is in stopping"] [namespace=default] [changefeed=A] [name=SourceManager]
# no ["Closing source manager"] and ["Closed source manager"] afterwards

referring to the stack trace reaching `(*SourceManager).Close()`,

github.com/pingcap/tiflow/cdc/processor/sourcemanager.(*SourceManager).Close
	github.com/pingcap/tiflow/cdc/processor/sourcemanager/manager.go:281
github.com/pingcap/tiflow/cdc/processor.(*component[...]).stop
	github.com/pingcap/tiflow/cdc/processor/processor.go:1057
github.com/pingcap/tiflow/cdc/processor.(*processor).Close
	github.com/pingcap/tiflow/cdc/processor/processor.go:906
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).closeProcessor
	github.com/pingcap/tiflow/cdc/processor/manager.go:284
github.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick
	github.com/pingcap/tiflow/cdc/processor/manager.go:116
github.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run
	github.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker
	github.com/pingcap/tiflow/cdc/capture/capture.go:596
github.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4
	github.com/pingcap/tiflow/cdc/capture/capture.go:400
golang.org/x/sync/errgroup.(*Group).Go.func1
	golang.org/x/[email protected]/errgroup/errgroup.go:78

we determined the runEtcdWorker thread is stuck at this line:

https://github.com/pingcap/tiflow/blob/15afae82c9f095252dfd1824d59b38b05f87f8fc/cdc/processor/processor.go#L1056

The c.wg is supposed to be cleared (.Done()) after (*SourceManager).Run() is complete:

https://github.com/pingcap/tiflow/blob/15afae82c9f095252dfd1824d59b38b05f87f8fc/cdc/processor/processor.go#L1021-L1035

meaning (*SourceManager).Run() is not canceled by c.cancel(). We are not sure in which branch the of the multiplexing puller this is stuck in.

Oct 16 '25 07:10 kennytm