tidb-operator Operator cannot upgrade version when some Pods crash

Bug Report

What version of Kubernetes are you using? Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:22:29Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.4-12.8d683d9", GitCommit:"8d683d982b20a8f28a62ad502db0f352e50f621c", GitTreeState:"clean", BuildDate:"2019-12-30T09:24:27Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.24) and server (1.16) exceeds the supported minor version skew of +/-1

What version of TiDB Operator are you using? TiDB Operator Version: version.Info{GitVersion:"v1.4.7-1+1200b7c2d69962", GitCommit:"1200b7c2d69962a28079109a2999aa17eb7f6ec6", GitTreeState:"clean", BuildDate:"2025-02-12T08:50:37Z", GoVersion:"go1.23.5", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

What's the status of the TiDB cluster pods?

CrashLoopBackOff What did you do?

Upgrade tikv from tikv 4.0.8 to 5.4.0
Before all node restart, scale in tikv replica from 33 to 32（there is still have v4.0.8 pod)
Upgrade blocked by scale in
Kill tikv-25( tikv-25 is v5.4.0 for now)
After tikv-25 start, it become v4.0.8 and run into crash
edit tc ,change tikv`s storage from 1500GB to 2000GB
After scale in completed, upgrade begin from tikv-31
When evict-leader tikv-25, as tikv-25 is in crash ,operator can`t get how many leaders in tikv-25. operator will enter dead loop for get leader count for tikv-25
What did you expect to see? upgrade will continue What did you see instead? upgrade blocked

Apr 16 '25 08:04 zgcbj

This is by design, as we don't want to shut down more repolicas when some of them have already been down (crashed)

Apr 21 '25 02:04 csuzhangxc

This is by design, as we don't want to shut down more repolicas when some of them have already been down (crashed)

The leader number is currently obtained from tikv. Can it be changed to be obtained from pd? If the leader number of tikv obtained from pd is 0, can it be upgraded directly?

Apr 21 '25 06:04 zgcbj

This is by design, as we don't want to shut down more repolicas when some of them have already been down (crashed)

The leader number is currently obtained from tikv. Can it be changed to be obtained from pd? If the leader number of tikv obtained from pd is 0, can it be upgraded directly?

no, ref #3801

Apr 21 '25 08:04 csuzhangxc

I still think this is a bug. You can try to kill any upgraded pod during the upgrade from version 4.0 to version 5, and the upgrade process will not continue. I suggest adding a logic to obtain the number of leaders from pd when tikv cannot obtain the number of leaders. If the number obtained from pd is also 0, you can obtain the start time of pd. If pd has just started as mentioned in #3801, you can wait for the next cycle.

Apr 21 '25 08:04 zgcbj

I solved the scenario I encountered above like this:

Modify the annotation of tikv-25 and add runmode:debug. The next time tikv-25 tries to start, it will run pod with tail /dev/null -f and enter debug mode.
Then copy the tikv-server of tikv-5.4 into this pod
Then modify tikv_start_up_script.sh and start it with the tikv-server of tikv-5.4
Then tikv-25 can correctly return the number of leaders = 0 and then be upgraded again.

I think this fix is a bit difficult. I suggest that you try to avoid the operator from entering this state.

Apr 21 '25 09:04 zgcbj