TiDB operator does not delete original configmap after user changes the config in the cr, causing resource leak

Open kos-team opened this issue 1 year ago • 1 comments

Bug Report

What version of Kubernetes are you using? Client Version: v1.31.0 Kustomize Version: v5.4.2 Server Version: v1.29.1

What version of TiDB Operator are you using? v1.6.0

What's the status of the TiDB cluster pods? All pods are in Running state

What did you do? We updated the spec.tikv.config field to a different non-empty value.

How to reproduce

Deploy a TiDB cluster, for example:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
  tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\ngraceful-wait-before-shutdown\
      \ = 30\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
  tikv:
    baseImage: pingcap/tikv
    config: |
      [raftdb]
        max-open-files = 256
      [rocksdb]
        max-open-files = 256
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

Change the spec.tikv.config to another non-empty value, e.g.

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
  tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\ngraceful-wait-before-shutdown\
      \ = 30\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
  tikv:
    baseImage: pingcap/tikv
    config: |
      [raftdb]
        max-open-files = 256
      [rocksdb]
        max-open-files = 128
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0

What did you expect to see? We expected to see that the unused ConfigMaps are garbage collected by the TiDB operator. This prevents the operator from keeping generating new ConfigMaps and adding more objects into the etcd.

What did you see instead? The operator created a new ConfigMap for TiKV but left the old ConfigMap undeleted. We observed the same behavior when updating spec.tiflash.config, which suggests that all TiDB components are likely affected by this issue.

Sep 23 '24 18:09 kos-team

Currently, we generate new ConfigMap for RollingUpdate ConfigUpdateStrategy. Only keep some recent ConfigMaps and delete other older ones may be better.

Sep 24 '24 03:09 csuzhangxc