pd icon indicating copy to clipboard operation
pd copied to clipboard

qps keep falling to zero until the fault recovered when injection pd leader io delay 500ms last for 10mins,we need some tuning parameters for this issue

Open Lily2025 opened this issue 1 year ago • 3 comments

Bug Report

What did you do?

1、run tpcc 2、injection pd leader io delay 500ms last for 10mins

What did you expect to see?

qps can recover within 5mins

What did you see instead?

qps keep falling to zero which last for 10mins until the fault recovered Image

What version of PD are you using (pd-server -V)?

./pd-server -V Release Version: v8.5.0-alpha-32-g90cc61b4 Edition: Community Git Commit Hash: 90cc61b432feb3744cc71aa82bd55785e4646396 Git Branch: HEAD UTC Build Time: 2024-11-21 08:41:25 2024-11-22T03:48:28.199+0800

Lily2025 avatar Nov 26 '24 02:11 Lily2025

/type enhancement /remove-type bug

Lily2025 avatar Nov 26 '24 02:11 Lily2025

/assign JmPotato

Lily2025 avatar Nov 26 '24 02:11 Lily2025

After investigation of the logs and metrics, in this case, the continuous drop to zero in QPS is that pd-1, injected with IO latency, did not lose its etcd leader status. As a result, the PD leader was repeatedly elected on this faulty node. Also, because the time for 3 re-elections happened to exceed 5 minutes, it did not trigger the condition of "being evicted as etcd leader if elected 3 consecutive times within 5 minutes". Ultimately, the unavailability persisted until the IO injection ended.

Perhaps we should provide a configurable option for the election circuit breaker threshold, e.g., 3 times within 10 minutes rather than just 5 minutes.

JmPotato avatar Nov 26 '24 02:11 JmPotato