Data replication is stuck when TiKV (replication worker) close the gRPC stream

Open pingyu opened this issue 2 months ago • 0 comments

What did you do?

Startup the upstream & downstream TiDB cluster (cloud engine arch, with replication worker) with TiCDC.
Start the replication.
Stop the replication worker in upstream cluster.
Restart the replication worker.

What did you expect to see?

The data replication resume after replication worker restart.

What did you see instead?

The data replication did not resume.

cdc-1.log.zip

TiCDC found the connection was closed, but did not reconnect.

[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task] [streamID=6] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]
[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task_owner_ddl_puller] [streamID=4] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]
[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task_processor_ddl_puller] [streamID=5] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]

The reason seems to be when StatusIsEOF, requestedStream will return nil. As g the errgroup.Group still waiting for the other routine, TiCDC will stuck.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

tidb-cse v7.5

Upstream TiKV version (execute tikv-server --version):

cse + replication worker

TiCDC version (execute cdc version):

e9c3dcb90f829836f4c334f8b31ebecc6248a8d3 (release-8.5)

Oct 12 '25 10:10 pingyu