TiDB operator fails and hangs to update the `status.status-port` in `spec.tidb.config`

Open kos-team opened this issue 1 year ago • 0 comments

Bug Report

What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2

What version of TiDB Operator are you using? v1.6.0

What did you do? We deployed a tidb cluster with 3 replicas of pd, tikv and tidb. After the cluster is fully up and healthy, we changed the spec.tidb.config and set status.status-port to 10079.

The last pod terminated and restarted with the updated configuration. However, TiDB operator cannot connect to the restarted pod because the status-port was still set to old port number (10080) in the StatefulSet, which is inconsistent with the TiDB's configuration. TiDB operator uses the port in the StatefulSet to query the health of the TiDB Pods, and fails to get TiDB's status. It mistakens the TiDB Pods as unhealthy and indefinitely waits.

The TiDB operator gets the status of the Pods in https://github.com/pingcap/tidb-operator/blob/1867f39610c990467df784268d8a9241667b7083/pkg/manager/member/tidb_member_manager.go#L1051 which constructs the URL with the v1alpha1.DefaultTiDBStatusPort at https://github.com/pingcap/tidb-operator/blob/24fa2832c4d1938e180b5baa6fde0450c38a8132/pkg/controller/tidb_control.go#L152

The status port on the TiDB container is also hardcoded: https://github.com/pingcap/tidb-operator/blob/1867f39610c990467df784268d8a9241667b7083/pkg/manager/member/tidb_member_manager.go#L935

How to reproduce

Deploy a TiDB cluster with TiProxy enabled, for example:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

Add status.status-port to the spec.tidb.config

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  tidb:
    baseImage: pingcap/tidb
    config: '
      [performance]

      tcp-keep-alive = true

      [status]

      status-port = 10079
      '
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see? We expected to see the tidb restarts and new configuration takes effect.

What did you see instead? The last pod terminated and restarted. However, operator cannot connect to the pod because the status-port was still set to 10080 in the statefulset. Thus, the operator thought the last pod still needs time to be ready and hanged.

Root Cause The operator uses 10080 as default value for the status-port. When user specify the status-port in spec.tidb.config, the operator still creates statefulset using the default value. This causes operator connecting to the pod using wrong port.

Dec 26 '24 22:12 kos-team