tiflow changefeed stucks when there are 100 changefeed and restarting PD

changefeed stucks when there are 100 changefeed and restarting PD

Open fubinzh opened this issue 11 months ago • 4 comments

What did you do?

TiDB cluster deployed in GCP GKE env, with 24 TiKV (16c64g) and 9 CDC node (16c64g), cluster size ~40TB. 3 workload running, one workload with row width ~1mb, one 9kb, one 1.7kb.
100 changefeed created, each changefeed cover 40 tables.
update PD configuration to trigger rolling restart

What did you expect to see?

CDC lag should be less than 10s

What did you see instead?

CDC changefeed stucks

Versions of the cluster

cdc version: Release Version: v8.0.0
Git Commit Hash: 130403f3b9b8ad8a28ceada642277986e317ebc2 Git Branch: heads/refs/tags/v8.0.0
UTC Build Time: 2024-03-15 13:58:58
Go Version: go version go1.21.6 linux/amd64
Failpoint Build: false

Mar 17 '24 11:03 fubinzh

/severity major

Mar 17 '24 11:03 fubinzh

the test env is not stable, I retested this case with 50 changefeeds, the max changefeed LAG is less than 5s.

Mar 19 '24 08:03 sdojjy

After PD rolling restart at 3/17 11:44, we can see that workload not balanced, cdc-8 has 4k tables, and CPU usage almost full, and disk size keep increasing and finally full.

Mar 20 '24 02:03 fubinzh

PD side issue: https://github.com/tikv/pd/issues/7973

Mar 25 '24 09:03 fubinzh

Close this issue, PD issue will be tracked seperately.

May 06 '24 07:05 fubinzh

tiflow tiflow copied to clipboard

changefeed stucks when there are 100 changefeed and restarting PD

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

tiflow
tiflow copied to clipboard