tiflow checkpoint stucks for over 1.5h, results in changefeed failure

checkpoint stucks for over 1.5h, results in changefeed failure

Open fubinzh opened this issue 8 months ago • 1 comments

What did you do?

There are five changefeed (Kafka sink, simple protocol) running, each replicating a subset of the tables.
At first there are workload running for all the changefeed, cdc lag not stable
Later there is only workload running for the changefeed: 1k-odd, 1k-even
Wait and check lag status

What did you expect to see?

Lag for all changefeed should be normal, at least for changefeeds whose workload running the lag should be normal after incremental scan finishes.

What did you see instead?

changefeed xxx5k not restore to normal. changefeed point stucks for 1.5h +, and finally changefeed into failure state.

Versions of the cluster

Release Version: v8.2.0-alpha
Git Commit Hash: 90da67d1af284ff7408140969b7d5fb2ecc43bea
Git Branch: heads/refs/tags/v8.2.0-alpha
UTC Build Time: 2024-06-14 11:38:03
Go Version: go version go1.21.10 linux/amd64
Failpoint Build: false

Jun 21 '24 09:06 fubinzh

tiflow tiflow copied to clipboard

checkpoint stucks for over 1.5h, results in changefeed failure

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

tiflow
tiflow copied to clipboard