TiDB operator fails and hangs to update the `status.status-port` in `spec.tidb.config`
Bug Report
What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2
What version of TiDB Operator are you using? v1.6.0
What did you do?
We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is fully up and healthy, we changed the spec.tidb.config and set status.status-port to 10079.
The last pod terminated and restarted with the updated configuration. However, TiDB operator cannot connect to the restarted pod because the status-port was still set to old port number (10080) in the StatefulSet, which is inconsistent with the TiDB's configuration. TiDB operator uses the port in the StatefulSet to query the health of the TiDB Pods, and fails to get TiDB's status. It mistakens the TiDB Pods as unhealthy and indefinitely waits.
The TiDB operator gets the status of the Pods in https://github.com/pingcap/tidb-operator/blob/1867f39610c990467df784268d8a9241667b7083/pkg/manager/member/tidb_member_manager.go#L1051 which constructs the URL with the v1alpha1.DefaultTiDBStatusPort at https://github.com/pingcap/tidb-operator/blob/24fa2832c4d1938e180b5baa6fde0450c38a8132/pkg/controller/tidb_control.go#L152
The status port on the TiDB container is also hardcoded: https://github.com/pingcap/tidb-operator/blob/1867f39610c990467df784268d8a9241667b7083/pkg/manager/member/tidb_member_manager.go#L935
How to reproduce
- Deploy a TiDB cluster with TiProxy enabled, for example:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: test-cluster
spec:
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
helper:
image: alpine:3.16.0
pd:
baseImage: pingcap/pd
config: "[dashboard]\n internal-proxy = true\n"
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 10Gi
pvReclaimPolicy: Retain
tidb:
baseImage: pingcap/tidb
config: '
[performance]
tcp-keep-alive = true
'
maxFailoverCount: 0
replicas: 3
service:
externalTrafficPolicy: Local
type: NodePort
tikv:
baseImage: pingcap/tikv
config: 'log-level = "info"
'
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 100Gi
timezone: UTC
version: v8.1.0
- Add
status.status-portto thespec.tidb.config
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: test-cluster
spec:
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
helper:
image: alpine:3.16.0
pd:
baseImage: pingcap/pd
config: "[dashboard]\n internal-proxy = true\n"
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 10Gi
pvReclaimPolicy: Retain
tidb:
baseImage: pingcap/tidb
config: '
[performance]
tcp-keep-alive = true
[status]
status-port = 10079
'
maxFailoverCount: 0
replicas: 3
service:
externalTrafficPolicy: Local
type: NodePort
tikv:
baseImage: pingcap/tikv
config: 'log-level = "info"
'
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 100Gi
timezone: UTC
version: v8.1.0
What did you expect to see? We expected to see the tidb restarts and new configuration takes effect.
What did you see instead?
The last pod terminated and restarted. However, operator cannot connect to the pod because the status-port was still set to 10080 in the statefulset. Thus, the operator thought the last pod still needs time to be ready and hanged.
Root Cause
The operator uses 10080 as default value for the status-port. When user specify the status-port in spec.tidb.config, the operator still creates statefulset using the default value. This causes operator connecting to the pod using wrong port.