pd scheduling service recover more than 5mins when inject scheduling primary network partition

scheduling service recover more than 5mins when inject scheduling primary network partition

Open Lily2025 opened this issue 1 year ago • 2 comments

Enhancement

What did you do?

1、run workload 2、inject network partition between scheduling primary and all other pods

What did you expect to see?

scheduling service can recover less than 5mins when inject scheduling primary network partition

What did you see instead?

scheduling service recover more than 5mins when inject scheduling primary network partition

What version of PD are you using (`pd-server -V`)?

./pd-server -V Release Version: v8.0.0-alpha Edition: Community Git Commit Hash: e199866f59e22e3759a8e9459ef33d57f784890d Git Branch: heads/refs/tags/v8.0.0-alpha UTC Build Time: 2024-02-26 11:38:17 2024-02-28T11:55:27.776+0800

Feb 28 '24 05:02 Lily2025

/assign rleungx

Feb 28 '24 05:02 Lily2025

It relies on hibernate region tick interval because currently, the switch of scheduling primary won't awake all regions. So the prepare checker cannot receive all regions' heartbeat in time.

Feb 28 '24 07:02 rleungx

pd pd copied to clipboard

scheduling service recover more than 5mins when inject scheduling primary network partition

Enhancement

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

pd
pd copied to clipboard

What version of PD are you using (`pd-server -V`)?