TiDB operator hangs after setting the TiDB's `status.report-status` configuration to `false`

Open kos-team opened this issue 1 year ago • 1 comments

Bug Report

What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2

What version of TiDB Operator are you using? v1.6.0

What did you do? We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is initialized, we set the status.report-status to false in spec.tidb.config and applied the change.

After the TiDB operator successfully reconfigures the TiDB cluster, it loses the connectivity to the TiDB cluster, and it mistakenly thinks that the TiDB cluster is unhealthy and constantly tries to run failover. The failover spawns new pods, however, the operator still cannot contact the new pods.

The healthcheck fails at https://github.com/pingcap/tidb-operator/blob/24fa2832c4d1938e180b5baa6fde0450c38a8132/pkg/manager/member/tidb_member_manager.go#L303 which constantly triggers the Failover function.

How to reproduce

Deploy a TiDB cluster with TiProxy enabled, for example:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

Add status.status-port to the spec.tidb.config

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true

      [status]

      report-status = false
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see? We expected to see either the tidb restarts and new configuration takes effect or the tidb operator rejects the change since the operations of tidb operator depend on HTTP API service.

What did you see instead? The last pod terminated and restarted. After that, tidb operator could not connect to the cluster and hanged.

Dec 26 '24 22:12 kos-team

We're planning to add a better webhook back in the upcoming TiDB Operator v2, and this verification may be implemented in the webhook

Dec 27 '24 02:12 csuzhangxc