clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Deleting the operator while clusters are running can cause them to be deleted without warning

Open hodgesrm opened this issue 3 years ago • 9 comments

Deleting the operator and then reinstalling it may cause currently running clusters to be deleted without warning. Here's a reproduction of the problem from current HEAD repo contents.

  1. Install operator according to instructions at docs.altinity.com: kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
  2. Start a cluster in default namespace by executing kubectl apply -f simple-01-a.yaml on attached file. Wait for cluster to install.
  3. Delete the operator (you have to ^C out of the command as it hangs): kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
  4. Install the operator again: kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
  5. At this point two things happen. 5.1 The simple-01-a.yaml cluster is deleted without warning. 5.2 The new operator installation is broken. Trying to install a new cluster fails as shown below.
error: unable to recognize "simple-01-a.yaml": no matches for kind "ClickHouseInstallation" in version "clickhouse.altinity.com/v1"

It looks as if 5.2 is explained by the following:

kubectl api-resources|grep click
clickhouseinstallationtemplates    chit         clickhouse.altinity.com/v1             true         ClickHouseInstallationTemplate
clickhouseoperatorconfigurations   chopconf     clickhouse.altinity.com/v1             true         ClickHouseOperatorConfiguration

The chi resource is now missing after the cluster was deleted.

hodgesrm avatar Dec 07 '21 23:12 hodgesrm

Here's the simple-01-a.yaml file to create a cluster, since Github won't let me attach it.

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "simple-01"
spec:
  configuration:
    clusters:
      - name: "cl"
        layout:
          shardsCount: 1
          replicasCount: 1
        templates:
          podTemplate: clickhouse-stable
          volumeClaimTemplate: storage-vc-template
  templates:
    podTemplates:
      - name: clickhouse-stable
        spec:
          containers:
          - name: clickhouse
            image: altinity/clickhouse-server:21.8.10.1.altinitystable
    volumeClaimTemplates:
      - name: storage-vc-template
        spec:
          storageClassName: standard
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

hodgesrm avatar Dec 07 '21 23:12 hodgesrm

For anyone reading this, the root cause is that deleting the bundle .yaml file deletes the custom resource definition. Kubernetes automatically cleans up. This is standard Kubernetes behavior though the corrupted api-resources seem to be a bonus feature. We are considering how to fix this. In the meantime, never run the following command while there are live ClickHouse clusters:

# Don't be this person. 
kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

hodgesrm avatar Dec 08 '21 00:12 hodgesrm

This issue was addressed in 0.18.0 release

sunsingerus avatar Feb 03 '22 13:02 sunsingerus

I am also facing same issue. I have mistakenly delete my operator using kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml.

ankito4s avatar Mar 22 '22 06:03 ankito4s

@sunsingerus is there any lingering concern about this specific bug (or some generalization of it) after the fixes in 0.18.0? I realize this issue is fairly old, but I've chanced across it and am now considering switching to the Retain reclaim policy to defend against this class of accidental data loss (and others).

zcross avatar Nov 09 '22 21:11 zcross

To solve this problem if you failed and deleted the CRD (thanks to! @hodgesrm)

There's a problem with stuck finalizers that can cause old CHI installations to hang. The sequence of operations looks like this.

  1. You delete the existing ClickHouse operator using kubectl delete -f operator-installation.yaml with running CHI clusters.
  2. You then drop the namespace where the CHI clusters are running, e.g., kubectl delete ns my-namespace
  3. This hangs. You run kubectl get ns my-namespace -o yaml and you'll see a message like the following:

'message': Some content in the namespace has finalizers remaining: finalizer.clickhouseinstallation.altinity.com

That means the CHI can't be deleted because its finalizer was deleted out from under it.

The fix is to figure out the chi name which should still be visible and edit it to remove the finalizer reference.

  1. kubectl -n my-namespace get chi
  2. kubectl -n my-namespace edit clickhouseinstallations.clickhouse.altinity.com my-clickhouse-cluster

Remove the finalizer from the spec, save it, and everything will delete properly.

lesandie avatar Jan 10 '23 13:01 lesandie

My operator got stuck in a non-runnable/error state, and nothing could be added/updated, or removed. I tried restarting the operator, but it always failed. This means that no deployments can be deleted.

I gave up and removed the operator.

Is there another way that does not require that the namespace is deleted?

acmeguy avatar Dec 02 '23 17:12 acmeguy

created by github.com/altinity/clickhouse-operator/pkg/controller/chi.(*Controller).Run 2023-12-02T17:58:31.302885271Z /clickhouse-operator/pkg/controller/chi/controller.go:509 +0x79d 2023-12-02T17:58:31.305923949Z panic: runtime error: index out of range [1] with length 1 [recovered] 2023-12-02T17:58:31.305930208Z panic: runtime error: index out of range [1] with length 1

acmeguy avatar Dec 02 '23 17:12 acmeguy

removing the finaliser information allows the deletion of the deployment:

kubectl patch ClickHouseInstallation :name -p '{"metadata":{"finalizers":null}}' --type=merge

(not sure if this helps)

acmeguy avatar Dec 02 '23 18:12 acmeguy