tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

checkpoint stucks for over 1.5h, results in changefeed failure

Open fubinzh opened this issue 8 months ago • 1 comments

What did you do?

  1. There are five changefeed (Kafka sink, simple protocol) running, each replicating a subset of the tables.
  2. At first there are workload running for all the changefeed, cdc lag not stable
  3. Later there is only workload running for the changefeed: 1k-odd, 1k-even
  4. Wait and check lag status

What did you expect to see?

  1. Lag for all changefeed should be normal, at least for changefeeds whose workload running the lag should be normal after incremental scan finishes.

What did you see instead?

changefeed xxx5k not restore to normal. changefeed point stucks for 1.5h +, and finally changefeed into failure state. image image

Versions of the cluster

Release Version: v8.2.0-alpha
Git Commit Hash: 90da67d1af284ff7408140969b7d5fb2ecc43bea
Git Branch: heads/refs/tags/v8.2.0-alpha
UTC Build Time: 2024-06-14 11:38:03
Go Version: go version go1.21.10 linux/amd64
Failpoint Build: false

fubinzh avatar Jun 21 '24 09:06 fubinzh