TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false`
Bug Report
What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2
What version of TiDB Operator are you using? v1.6.0
What did you do?
We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is initialized, we set the status.report-status to false in spec.tidb.config and applied the change.
After the TiDB operator successfully reconfigures the TiDB cluster, it loses the connectivity to the TiDB cluster, and it mistakenly thinks that the TiDB cluster is unhealthy and constantly tries to run failover. The failover spawns new pods, however, the operator still cannot contact the new pods.
The healthcheck fails at https://github.com/pingcap/tidb-operator/blob/24fa2832c4d1938e180b5baa6fde0450c38a8132/pkg/manager/member/tidb_member_manager.go#L303 which constantly triggers the Failover function.
How to reproduce
- Deploy a TiDB cluster with TiProxy enabled, for example:
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: test-cluster
spec:
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
helper:
image: alpine:3.16.0
pd:
baseImage: pingcap/pd
config: "[dashboard]\n internal-proxy = true\n"
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 10Gi
pvReclaimPolicy: Retain
tidb:
baseImage: pingcap/tidb
config: '
[performance]
tcp-keep-alive = true
'
maxFailoverCount: 0
replicas: 3
service:
externalTrafficPolicy: Local
type: NodePort
tikv:
baseImage: pingcap/tikv
config: 'log-level = "info"
'
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 100Gi
timezone: UTC
version: v8.1.0
- Add
status.status-portto thespec.tidb.config
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: test-cluster
spec:
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
helper:
image: alpine:3.16.0
pd:
baseImage: pingcap/pd
config: "[dashboard]\n internal-proxy = true\n"
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 10Gi
pvReclaimPolicy: Retain
tidb:
baseImage: pingcap/tidb
config: '
[performance]
tcp-keep-alive = true
[status]
report-status = false
'
maxFailoverCount: 0
replicas: 3
service:
externalTrafficPolicy: Local
type: NodePort
tikv:
baseImage: pingcap/tikv
config: 'log-level = "info"
'
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 100Gi
timezone: UTC
version: v8.1.0
What did you expect to see? We expected to see either the tidb restarts and new configuration takes effect or the tidb operator rejects the change since the operations of tidb operator depend on HTTP API service.
What did you see instead? The last pod terminated and restarted. After that, tidb operator could not connect to the cluster and hanged.
We're planning to add a better webhook back in the upcoming TiDB Operator v2, and this verification may be implemented in the webhook