tiflow
tiflow copied to clipboard
Data replication is stuck when TiKV (replication worker) close the gRPC stream
What did you do?
- Startup the upstream & downstream TiDB cluster (cloud engine arch, with replication worker) with TiCDC.
- Start the replication.
- Stop the replication worker in upstream cluster.
- Restart the replication worker.
What did you expect to see?
The data replication resume after replication worker restart.
What did you see instead?
The data replication did not resume.
TiCDC found the connection was closed, but did not reconnect.
[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task] [streamID=6] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]
[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task_owner_ddl_puller] [streamID=4] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]
[2025/10/12 08:25:02.517 +00:00] [DEBUG] [shared_stream.go:233] ["event feed receive from grpc stream failed"] [namespace=default] [changefeed=rep-task_processor_ddl_puller] [streamID=5] [addr=127.0.0.1:5999] [code=Unknown] [error=EOF]
The reason seems to be when StatusIsEOF, requestedStream will return nil. As g the errgroup.Group still waiting for the other routine, TiCDC will stuck.
Versions of the cluster
Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
tidb-cse v7.5
Upstream TiKV version (execute tikv-server --version):
cse + replication worker
TiCDC version (execute cdc version):
e9c3dcb90f829836f4c334f8b31ebecc6248a8d3 (release-8.5)