clickhouse-operator
clickhouse-operator copied to clipboard
Deleting the operator while clusters are running can cause them to be deleted without warning
Deleting the operator and then reinstalling it may cause currently running clusters to be deleted without warning. Here's a reproduction of the problem from current HEAD repo contents.
- Install operator according to instructions at docs.altinity.com:
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml - Start a cluster in default namespace by executing
kubectl apply -f simple-01-a.yamlon attached file. Wait for cluster to install. - Delete the operator (you have to ^C out of the command as it hangs):
kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml - Install the operator again:
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml - At this point two things happen. 5.1 The simple-01-a.yaml cluster is deleted without warning. 5.2 The new operator installation is broken. Trying to install a new cluster fails as shown below.
error: unable to recognize "simple-01-a.yaml": no matches for kind "ClickHouseInstallation" in version "clickhouse.altinity.com/v1"
It looks as if 5.2 is explained by the following:
kubectl api-resources|grep click
clickhouseinstallationtemplates chit clickhouse.altinity.com/v1 true ClickHouseInstallationTemplate
clickhouseoperatorconfigurations chopconf clickhouse.altinity.com/v1 true ClickHouseOperatorConfiguration
The chi resource is now missing after the cluster was deleted.
Here's the simple-01-a.yaml file to create a cluster, since Github won't let me attach it.
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: "simple-01"
spec:
configuration:
clusters:
- name: "cl"
layout:
shardsCount: 1
replicasCount: 1
templates:
podTemplate: clickhouse-stable
volumeClaimTemplate: storage-vc-template
templates:
podTemplates:
- name: clickhouse-stable
spec:
containers:
- name: clickhouse
image: altinity/clickhouse-server:21.8.10.1.altinitystable
volumeClaimTemplates:
- name: storage-vc-template
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
For anyone reading this, the root cause is that deleting the bundle .yaml file deletes the custom resource definition. Kubernetes automatically cleans up. This is standard Kubernetes behavior though the corrupted api-resources seem to be a bonus feature. We are considering how to fix this. In the meantime, never run the following command while there are live ClickHouse clusters:
# Don't be this person.
kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
This issue was addressed in 0.18.0 release
I am also facing same issue. I have mistakenly delete my operator using kubectl delete -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml.
@sunsingerus is there any lingering concern about this specific bug (or some generalization of it) after the fixes in 0.18.0? I realize this issue is fairly old, but I've chanced across it and am now considering switching to the Retain reclaim policy to defend against this class of accidental data loss (and others).
To solve this problem if you failed and deleted the CRD (thanks to! @hodgesrm)
There's a problem with stuck finalizers that can cause old CHI installations to hang. The sequence of operations looks like this.
- You delete the existing ClickHouse operator using
kubectl delete -f operator-installation.yamlwith running CHI clusters. - You then drop the namespace where the CHI clusters are running, e.g.,
kubectl delete ns my-namespace - This hangs. You run
kubectl get ns my-namespace -o yamland you'll see a message like the following:
'message': Some content in the namespace has finalizers remaining: finalizer.clickhouseinstallation.altinity.com
That means the CHI can't be deleted because its finalizer was deleted out from under it.
The fix is to figure out the chi name which should still be visible and edit it to remove the finalizer reference.
kubectl -n my-namespace get chikubectl -n my-namespace edit clickhouseinstallations.clickhouse.altinity.com my-clickhouse-cluster
Remove the finalizer from the spec, save it, and everything will delete properly.
My operator got stuck in a non-runnable/error state, and nothing could be added/updated, or removed. I tried restarting the operator, but it always failed. This means that no deployments can be deleted.
I gave up and removed the operator.
Is there another way that does not require that the namespace is deleted?
created by github.com/altinity/clickhouse-operator/pkg/controller/chi.(*Controller).Run 2023-12-02T17:58:31.302885271Z /clickhouse-operator/pkg/controller/chi/controller.go:509 +0x79d 2023-12-02T17:58:31.305923949Z panic: runtime error: index out of range [1] with length 1 [recovered] 2023-12-02T17:58:31.305930208Z panic: runtime error: index out of range [1] with length 1
removing the finaliser information allows the deletion of the deployment:
kubectl patch ClickHouseInstallation :name -p '{"metadata":{"finalizers":null}}' --type=merge
(not sure if this helps)