clickhouse-operator
clickhouse-operator copied to clipboard
Cluster service deleted on upgrade due to reconcile failure
While performing an upgrade via Helm from 0.23.2 to 0.23.6, I ran across a problem where the cluster service disappeared. I also included a minor upgrade of the altinitystable image, but I don't think that is related.
The important bits in my CHI resource:
spec:
defaults:
templates:
podTemplate: default-clickhouse-pod
dataVolumeClaimTemplate: default-data-volume
logVolumeClaimTemplate: default-log-volume
clusterServiceTemplate: default-service-template
configuration:
settings:
logger/level: information
clusters:
- name: events
layout:
shardsCount: 1
replicasCount: 3
secret:
auto: "true"
templates:
serviceTemplates:
- name: default-service-template
generateName: clickhouse-{chi}
metadata:
annotations:
cloud.google.com/load-balancer-type: "Internal"
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
service.beta.kubernetes.io/cce-load-balancer-internal-vpc: "true"
spec:
ports:
- name: http
port: 8123
- name: tcp
port: 9000
type: LoadBalancer
When the operator upgraded, it appeared to get stuck attempting to convert clickhouse-events from a LoadBalancer to a ClusterIP. I believe this is somehow related to this commit that changes the default from LoadBalancer to ClusterIP. However, this CHI has always explicitly set the template to use LoadBalancer.
On startup, I saw this in the logs:
I0710 05:05:48.675757 1 service.go:86] CreateServiceCluster():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:foo/clickhouse-events
I0710 05:05:48.676889 1 worker-chi-reconciler.go:907] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:05:48.840035 1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:05:49.062109 1 worker.go:1480] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:OK Create Service: foo/clickhouse-events
I0710 05:05:49.883043 1 worker-chi-reconciler.go:922] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service reconcile successful: foo/clickhouse-events
...
I0710 05:06:25.213119 1 worker-chi-reconciler.go:900] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Service found: foo/clickhouse-events. Will try to update
E0710 05:06:25.213168 1 worker-chi-reconciler.go:914] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:06:26.384478 1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:06:26.584816 1 worker.go:1486] createService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:06:27.422151 1 worker-chi-reconciler.go:928] reconcileService():foo/events/c857c1dd-66bf-4182-96eb-3e45f61664ee:FAILED to reconcile Service: foo/clickhouse-events CHI: events
It now appears to be recreated on a forced restart of the operator, and then a minute or so later, is deleted again. It won't be recreated until the operator restarts again.
I0710 05:16:25.276854 1 service.go:86] CreateServiceCluster():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:foo/clickhouse-events
I0710 05:16:25.278246 1 worker-chi-reconciler.go:907] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 05:16:25.435511 1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 05:16:25.805221 1 worker.go:1480] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:OK Create Service: foo/clickhouse-events
I0710 05:16:26.468825 1 worker-chi-reconciler.go:922] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service reconcile successful: foo/clickhouse-events
...
I0710 05:17:26.904518 1 worker-chi-reconciler.go:900] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Service found: foo/clickhouse-events. Will try to update
E0710 05:17:26.904648 1 worker-chi-reconciler.go:914] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 05:17:28.073703 1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 05:17:28.274057 1 worker.go:1486] createService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 05:17:29.119358 1 worker-chi-reconciler.go:928] reconcileService():foo/events/a84245ce-80bc-4fb9-ad58-c69a77019a70:FAILED to reconcile Service: foo/clickhouse-events CHI: events
Note: When it is creates, it is created correctly as a LoadBalancer, but then the second resource reconciliation attempts to make it a ClusterIP again.
Did you upgrade CRDs separatelly as described in https://github.com/Altinity/clickhouse-operator/blob/master/deploy/helm/clickhouse-operator/README.md?
@Slach I did not update the CRD. I have done so now, and it still is happening. Do I need to manually set a status.hostsUnchanged value in the CHI status?
% kubectl -n clickhouse get deploy chop-altinity-clickhouse-operator -o yaml | grep "image:"
image: altinity/clickhouse-operator:0.23.6
image: altinity/metrics-exporter:0.23.6
% kubectl get crd clickhouseinstallations.clickhouse.altinity.com -o yaml | grep "clickhouse.altinity.com/chop"
clickhouse.altinity.com/chop: 0.23.6
I0710 17:16:03.682328 1 service.go:86] CreateServiceCluster():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:foo/clickhouse-events
I0710 17:16:03.683552 1 worker-chi-reconciler.go:907] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service: foo/clickhouse-events not found. err: service "clickhouse-events" not found
I0710 17:16:03.850464 1 deleter.go:322] deleteServiceIfExists():foo/clickhouse-events:Not Found Service: foo/clickhouse-events err: services "clickhouse-events" not found
I0710 17:16:04.074642 1 worker.go:1480] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:OK Create Service: foo/clickhouse-events
I0710 17:16:04.882621 1 worker-chi-reconciler.go:922] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service reconcile successful: foo/clickhouse-events
I0710 17:16:17.088227 1 worker-chi-reconciler.go:900] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Service found: foo/clickhouse-events. Will try to update
E0710 17:16:17.088280 1 worker-chi-reconciler.go:914] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:Update Service: foo/clickhouse-events failed with error: just recreate the service in case of service type change 'LoadBalancer'=>'ClusterIP'
I0710 17:16:18.254178 1 deleter.go:329] deleteServiceIfExists():foo/clickhouse-events:OK delete Service: foo/clickhouse-events
E0710 17:16:18.453646 1 worker.go:1486] createService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED Create Service: foo/clickhouse-events err: object is being deleted: services "clickhouse-events" already exists
E0710 17:16:19.295985 1 worker-chi-reconciler.go:928] reconcileService():foo/events/e425f007-f928-4649-92ed-3fb1598f1915:FAILED to reconcile Service: foo/clickhouse-events CHI: events
Service: foo/clickhouse-events
Service: foo/clickhouse-events
Service: foo/clickhouse-events
I tried adding a value into status.hostsUnchanged (that was the only change compared the the old CRD installed), and it made no difference. The CHOP is still constantly deleting the cluster service.
--- deploy/operatorhub/0.23.2/clickhouseinstallations.clickhouse.altinity.com.crd.yaml 2024-07-10 14:26:54
+++ deploy/operatorhub/0.23.6/clickhouseinstallations.clickhouse.altinity.com.crd.yaml 2024-07-10 14:26:54
@@ -4,14 +4,14 @@
# SINGULAR=clickhouseinstallation
# PLURAL=clickhouseinstallations
# SHORT=chi
-# OPERATOR_VERSION=0.23.2
+# OPERATOR_VERSION=0.23.6
#
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: clickhouseinstallations.clickhouse.altinity.com
labels:
- clickhouse.altinity.com/chop: 0.23.2
+ clickhouse.altinity.com/chop: 0.23.6
spec:
group: clickhouse.altinity.com
scope: Namespaced
@@ -53,6 +53,11 @@
type: string
description: CHI status
jsonPath: .status.status
+ - name: hosts-unchanged
+ type: integer
+ description: Unchanged hosts count
+ priority: 1 # show in wide view
+ jsonPath: .status.hostsUnchanged
- name: hosts-updated
type: integer
description: Updated hosts count
@@ -172,6 +177,10 @@
nullable: true
items:
type: string
+ hostsUnchanged:
+ type: integer
+ minimum: 0
+ description: "Unchanged Hosts count"
hostsUpdated:
type: integer
minimum: 0