pd icon indicating copy to clipboard operation
pd copied to clipboard

scheduling service recover more than 5mins when inject scheduling primary network partition

Open Lily2025 opened this issue 1 year ago • 2 comments

Enhancement

What did you do?

1、run workload 2、inject network partition between scheduling primary and all other pods image

What did you expect to see?

scheduling service can recover less than 5mins when inject scheduling primary network partition

What did you see instead?

scheduling service recover more than 5mins when inject scheduling primary network partition image

What version of PD are you using (pd-server -V)?

./pd-server -V Release Version: v8.0.0-alpha Edition: Community Git Commit Hash: e199866f59e22e3759a8e9459ef33d57f784890d Git Branch: heads/refs/tags/v8.0.0-alpha UTC Build Time: 2024-02-26 11:38:17 2024-02-28T11:55:27.776+0800

Lily2025 avatar Feb 28 '24 05:02 Lily2025

/assign rleungx

Lily2025 avatar Feb 28 '24 05:02 Lily2025

It relies on hibernate region tick interval because currently, the switch of scheduling primary won't awake all regions. So the prepare checker cannot receive all regions' heartbeat in time.

rleungx avatar Feb 28 '24 07:02 rleungx