TiDB transiently unavailable when rolling update the TiKV cluster

Open kos-team opened this issue 9 months ago • 0 comments

Bug Report

What version of Kubernetes are you using? 1.31

What version of TiDB Operator are you using? v1.6.0

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods? standard

What's the status of the TiDB cluster pods?

NAME                                       READY   STATUS    RESTARTS      AGE
test-cluster-discovery-59d967d9f-nbdkf     1/1     Running   0             56m
test-cluster-pd-0                          1/1     Running   0             29m
test-cluster-pd-1                          1/1     Running   0             6m48s
test-cluster-pd-2                          1/1     Running   0             6m48s
test-cluster-ticdc-0                       1/1     Running   0             22m
test-cluster-ticdc-1                       1/1     Running   0             22m
test-cluster-ticdc-2                       1/1     Running   0             22m
test-cluster-tidb-0                        2/2     Running   0             22m
test-cluster-tidb-1                        2/2     Running   0             23m
test-cluster-tidb-2                        2/2     Running   0             24m
test-cluster-tiflash-0                     4/4     Running   0             26m
test-cluster-tiflash-1                     4/4     Running   0             27m
test-cluster-tiflash-2                     4/4     Running   0             28m
test-cluster-tikv-0                        1/1     Running   0             13m
test-cluster-tikv-1                        1/1     Running   0             8m33s
test-cluster-tikv-2                        1/1     Running   0             9m51s
tidb-controller-manager-59c5d6499f-55qwl   1/1     Running   0             57m

What did you do?

Install cluster via applying CR

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
  tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

Upgrade the TiKV cluster which causes the TiKV statefulset to be rolling upgraded. For example, changing the enableDynamicConfiguration from true to false.

What did you expect to see? The cluster remains highly available during upgrade operations.

We tried to do the upgrade manually, by following the procedure of leader eviction, pod restart, and remove the leader eviction scheduler, and were able to maintain a 100% availability.

What did you see instead? The cluster loses availability for one minute during the operation. The root cause is improper leader eviction. We see that the operator tries to evict the leaders from the TiKV pod before restarting the pod. However, when restarting the last TiKV pod, the operator does not wait for the eviction to be fully completed.

Mar 31 '25 01:03 kos-team