When upgrad or reload a cluster, increase retry when accessing PD

Open wsiqiang6 opened this issue 11 months ago • 0 comments

Bug Report

What did you do? tiup cluster upgrade <clsuter_name>

In the TiKV evict leader phase : error requesting pd api , response: no leader

What did you expect to see?

After investigation, it was found that due to the leader priority setting in PD, a leader switch occurred during the "upgrade cluster" pd stage. Subsequently, PD checked the leader priority every minute, causing a PD leader transfer that took 0.5 seconds.

Coincidentally, during this 0.5-second window, the upgrade cluster process had already reached the TiKV stage and was performing the "set leader evict scheduler" operation, resulting in a "no leader" error when accessing PD, which caused TiUP to exit.

I think a retry mechanism should be added when calling the PD API to prevent TiUP upgrade or reload operations from being interrupted due to such short-term changes in PD.

What did you see instead? tiup error exits
What version of TiUP are you using (tiup --version)? v1.14.0

Jan 17 '25 09:01 wsiqiang6