tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

changefeed stucks when there are 100 changefeed and restarting PD

Open fubinzh opened this issue 11 months ago • 4 comments

What did you do?

  1. TiDB cluster deployed in GCP GKE env, with 24 TiKV (16c64g) and 9 CDC node (16c64g), cluster size ~40TB. 3 workload running, one workload with row width ~1mb, one 9kb, one 1.7kb.
  2. 100 changefeed created, each changefeed cover 40 tables.
  3. update PD configuration to trigger rolling restart

What did you expect to see?

CDC lag should be less than 10s

What did you see instead?

CDC changefeed stucks

image image

Versions of the cluster

cdc version: Release Version: v8.0.0
Git Commit Hash: 130403f3b9b8ad8a28ceada642277986e317ebc2 Git Branch: heads/refs/tags/v8.0.0
UTC Build Time: 2024-03-15 13:58:58
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false

fubinzh avatar Mar 17 '24 11:03 fubinzh

/severity major

fubinzh avatar Mar 17 '24 11:03 fubinzh

the test env is not stable, I retested this case with 50 changefeeds, the max changefeed LAG is less than 5s.

sdojjy avatar Mar 19 '24 08:03 sdojjy

After PD rolling restart at 3/17 11:44, we can see that workload not balanced, cdc-8 has 4k tables, and CPU usage almost full, and disk size keep increasing and finally full.

image

image

fubinzh avatar Mar 20 '24 02:03 fubinzh

PD side issue: https://github.com/tikv/pd/issues/7973

fubinzh avatar Mar 25 '24 09:03 fubinzh

Close this issue, PD issue will be tracked seperately.

fubinzh avatar May 06 '24 07:05 fubinzh