pd icon indicating copy to clipboard operation
pd copied to clipboard

qps drop more than 2 mins and also affect pitr and cdc lag when injection pd leader io delay 500ms or 1s or io hang due to a circuit breaker mechanism which is by design

Open Lily2025 opened this issue 5 months ago • 1 comments

Bug Report

What did you do?

1、run tpcc 2、inject pd leader io delay 500ms

What did you expect to see?

qps can recover within 2mins

What did you see instead?

qps drop last 4mins when injection pd leader io delay 500ms

clinic: https://clinic.pingcap.com.cn/portal/#/orgs/31/clusters/7370231614967615066?from=1716078044&to=1716079583

img_v3_02b3_58a53387-ce44-406c-bcfe-02e75a0dc0fg img_v3_02b3_c5870b04-44f6-4d93-a5a7-121c64d26f3g

2024-05-19 08:31:01 {"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1816] ["no longer a leader because lease has expired, pd leader will step down"]"}

The PD-0 lost its PD leader at 08:31:01

2024-05-19 08:31:13 {"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1733] ["campaign PD leader ok"] [campaign-leader-name=tc-pd-0]"}

At 08:31:13, since PD-0 was still the etcd leader, it was re-elected as the PD leader

2024-05-19 08:31:28 {"container":"pd","level":"INFO","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1816] ["no longer a leader because lease has expired, pd leader will step down"]"}

However, because io chaos continued, PD-0 dropped the PD leader again at 08:31:28, and then triggered the expulsion of the etcd leader mechanism after repeated three times:

2024-05-19 08:33:22 {"container":"pd","level":"ERROR","namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-0","log":"[server.go:1713] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=tc-pd-0] [error="[PD:server:ErrLeaderFrequentlyChange]leader tc-pd-0 frequently changed, leader-key is [/pd/7370231614967615066/leader]"]"}

2024-05-19 08:33:20 {"namespace":"endless-ha-test-oltp-pitr-tps-7539921-1-525","pod":"tc-pd-1","log":"[server.go:1733] ["campaign PD leader ok"] [campaign-leader-name=tc-pd-1]","level":"INFO","container":"pd"}

At 08:33:22, PD took the initiative to oust the etcd leader, and PD-1 was elected etcd and PD leader

If the etcd leader does not actively switch, the PD can only passively switch the etcd leader after three consecutive pd leader election failures

What version of PD are you using (pd-server -V)?

v8.1.0 githash: fca469ca33eb5d8b5e0891b507c87709a00b0e81

Lily2025 avatar Sep 04 '24 07:09 Lily2025