tiflow
tiflow copied to clipboard
changefeed stucks when there are 100 changefeed and restarting PD
What did you do?
- TiDB cluster deployed in GCP GKE env, with 24 TiKV (16c64g) and 9 CDC node (16c64g), cluster size ~40TB. 3 workload running, one workload with row width ~1mb, one 9kb, one 1.7kb.
- 100 changefeed created, each changefeed cover 40 tables.
- update PD configuration to trigger rolling restart
What did you expect to see?
CDC lag should be less than 10s
What did you see instead?
CDC changefeed stucks
Versions of the cluster
cdc version:
Release Version: v8.0.0
Git Commit Hash: 130403f3b9b8ad8a28ceada642277986e317ebc2
Git Branch: heads/refs/tags/v8.0.0
UTC Build Time: 2024-03-15 13:58:58
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false
/severity major
the test env is not stable, I retested this case with 50 changefeeds, the max changefeed LAG is less than 5s.
After PD rolling restart at 3/17 11:44, we can see that workload not balanced, cdc-8 has 4k tables, and CPU usage almost full, and disk size keep increasing and finally full.
PD side issue: https://github.com/tikv/pd/issues/7973
Close this issue, PD issue will be tracked seperately.