clickhouse-operator
clickhouse-operator copied to clipboard
Re-Creating node from scratch does not copy tables for the Postgres and Kafka engines
We use your Operator to manage Clickhouse cluster. Thank you.
After some hardware failure we reset PVC (and zookeeper namespace) to re-create one clickhouse node.
Most of metadata like views, materialized views and tables with most engines (MergeTree, ReplicatedMergeTree etc.) was successfully re-created on the node and replication was started.
Meantime none of Postgres and Kafka based engines tables was recreated. Is it a bug, or we need to use some commands or hacks to sync all metadata across the cluster?
@Hubbitus , have you used latest 0.23.6 or earlier release?
@alex-zaitsev, thank you for the response.
That was in older version. Now we have updated operator. What is a correct way to re-init node? Is it enough to just delete PVC of failed node and delete POD?
@Hubbitus , if you want to re-init the existing node, delete STS, PVC, PV and start re-concile. Do you have multiple replicas?
@alex-zaitsev, thank you for the reply.
I understand how to delete objects. But what you are meant under "start re-concile"?
I have two replicas chi-gid-gid-0-0-0 and chi-gid-gid-0-1-0. And now chi-gid-gid-0-0-0 is misfunction. I want to re-init it from the data in chi-gid-gid-0-1-0. And that should include sync all:
- metadata (all type of objects like MergeTree tables, Postgres, kafka engines, materialized views, etc)
- populate it with data from replica 1
- Users and all permissions to the objects
@Hubbitus , we have released 0.23.7 that is more aggressive re-creating the schema. So you may try to delete PVC/PV completely, and let it to re-create the objects.
@alex-zaitsev, thank you very much! Eventually I get it updated for our cluster:
kub_dev get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" -l app=clickhouse-operator
altinity/clickhouse-operator:0.23.7 altinity/metrics-exporter:0.23.7
And doing in ArgoCD:
- Deleted
PVCdefault-volume-claim-chi-gid-gid-0-0-0 - Deleted
podchi-gid-gid-0-0-0
Then PVC had been re-created.
I see pod is up and running.
- But there are a lot of errors like
2024.09.04 23:50:34.382651 [ 712 ] {} <Error> Access(user directories): from: 10.42.9.104, user: data_quality: Authentication failed: Code: 192. DB::Exception: There is no userdata_qualityin local_directory. (UNKNOWN_USER).... So, users are not copied - Tables looks like also not synced:
SELECT hostname() as node, COUNT(*)
FROM clusterAllReplicas('{cluster}', system.tables)
WHERE database NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system')
GROUP BY node
| node | count() |
|---|---|
| chi-gid-gid-0-1-0 | 620 |
And also error in log like: 2024.09.04 23:52:49.039132 [ 714 ] {bb628508-db8e-4cf9-8307-a13133a185c9} <Error> PredefinedQueryHandler: Code: 60. DB::Exception: Table system.operator_compatible_metrics does not exist. (UNKNOWN_TABLE) - so even in system database some tables missing...
So, I see only tables in information_schema for the 1-st node.
Notes:
- Users are not replicated by operator since it can not access sensitive data (like passwords). Use CHI/XML user management or replicated user directory.
<clickhouse>
<user_directories replace="replace">
<users_xml>
<path>/etc/clickhouse-server/users.xml</path>
</users_xml>
<replicated>
<zookeeper_path>/clickhouse/access/</zookeeper_path>
</replicated>
<local_directory>
<path>/var/lib/clickhouse/access/</path>
</local_directory>
</user_directories>
</clickhouse>
Note, the order is important, but local_directory may be skipped if you are not using it. Keep it, if there are users defined with CREATE USER already, otherwise they disappear at all.
- Tables in system database are not replicated as well, since it is supposed there are no user tables in there.
Others should work, so operator log is needed to check what went wrong.
The correct PVC recovery sequence is:
- Delete PVC (or PVC and STS)
- Run reconcile adding taskID to CHI, for instance
Looks like since you have deleted PVC and Pod, the recovery has been handled by Kubernetes (STS), and Operator even did not know that PVC has been recreated. So make sure you delete STS as well. Also consider using operator managed persistance:
spec:
defaults:
storageManagement:
provisioner: Operator
@alex-zaitsev, very thank you for the answer. First I would like to recover my tables, then I will try to deal with users.
Today, I eventfully receive rights to see operator pod in kube-system namespace. And just after deletion of PVC and pod I see errors in clickhouse-operator pod:
I0921 22:13:23.555553 1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:23.686901 1 worker.go:266] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Add Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:32.391425 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.391446 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
E0921 22:13:32.394908 1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp 10.42.9.84:8123: connect: connection refused for
SQL: SYSTEM DROP DNS CACHE
W0921 22:13:32.394938 1 retry.go:52] exec():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:FAILED single try. No retries will be made for Applying sqls
I0921 22:13:32.414341 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.414363 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.415447 1 worker.go:387] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:updating endpoints for CHI-1 gid
I0921 22:13:32.450485 1 worker.go:389] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:32.464127 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.464172 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.466517 1 worker.go:393] gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update users IPS-1
I0921 22:13:32.481724 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0921 22:13:42.168333 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.168355 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.190633 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.190651 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.191751 1 worker.go:387] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:updating endpoints for CHI-1 gid
I0921 22:13:42.215106 1 worker.go:389] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:42.224452 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.224470 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.225507 1 worker.go:393] gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update users IPS-1
I0921 22:13:42.235027 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
As we are speaking, I have tried to reconcile cluster by providing:
spec:
taskID: "click-reconcile-1"
Indeed, that looks like triggering reconcile. Logs of operator pod:
kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator --tail=1000
I0929 11:54:59.076600 1 worker.go:574] ActionPlan start---------------------------------------------:
Diff start -------------------------
modified spec items num: 1
diff item [0]:'.TaskID' = '"click-reconcile-1"'
Diff end -------------------------
ActionPlan end---------------------------------------------
I0929 11:54:59.076655 1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-1:ActionPlan has actions - continue reconcile
I0929 11:54:59.125555 1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-1:reconcile started, task id: click-reconcile-1
I0929 11:54:59.681288 1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I0929 11:54:59.681436 1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I0929 11:54:59.681607 1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:54:59.859367 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:00.648852 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:01.284151 1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-1:gidplatform-dev/cluster-gid-gid
I0929 11:55:01.294688 1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I0929 11:55:01.294746 1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I0929 11:55:01.294769 1 worker-chi-reconciler.go:568] starting first shard separately
I0929 11:55:01.294967 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:01.305993 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:01.306072 1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:01.897135 1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:01.897345 1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:02.047624 1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:02.047656 1 worker.go:1170] shouldExcludeHost():Host is the same, would not be updated, no need to exclude. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047669 1 worker.go:1005] worker.go:1002:excludeHost():end:exclude host end
I0929 11:55:02.047693 1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:55:02.047730 1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047779 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:02.087023 1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:02.087048 1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:55:02.248789 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-0
I0929 11:55:02.884163 1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0
I0929 11:55:03.458635 1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:55:03.458764 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:03.465752 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:03.472628 1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-0. ClickHouse version: 24.2.1.2248
I0929 11:55:03.651853 1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:03.651943 1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-0. Reconcile StatefulSet
I0929 11:55:03.655273 1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:04.097497 1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:55:04.654273 1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-0. Will try to update
I0929 11:55:04.853666 1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487521 1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487592 1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-0
I0929 11:55:05.487682 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:05.495665 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:05.495739 1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:05.495824 1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:05.495957 1 worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid
I0929 11:55:05.496005 1 worker.go:1057] includeHost():Include into cluster host 0 shard 0 cluster gid
I0929 11:55:05.496048 1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 0 shard 0 cluster gid
I0929 11:55:05.496070 1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:05.648655 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:06.449496 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:06.463606 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:06.463648 1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:06.463703 1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:07.086061 1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:55:07.085979541 +0000 UTC m=+530555.182385088] ProgressHostsCompleted: 1 of 2
I0929 11:55:08.084486 1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/clickhouse-gid. Will try to update
I0929 11:55:08.253098 1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883102 1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883295 1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:55:08.889935 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:55:08.890015 1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:55:09.524136 1 worker.go:1572] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:55:09.524219 1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:09.647870 1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:55:09.647935 1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 1/0/gid
I0929 11:55:09.647982 1 worker.go:1010] excludeHost():Exclude from cluster host 1 shard 0 cluster gid
I0929 11:55:10.090456 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.090524 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.132801 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.132824 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.134283 1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:55:10.256392 1 worker.go:1099] excludeHostFromClickHouseCluster():going to exclude host 1 shard 0 cluster gid
I0929 11:55:10.256420 1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:10.651725 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:10.847886 1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:55:10.859857 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.859903 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.862438 1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:55:11.249384 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:11.887237 1 worker.go:1203] shouldWaitExcludeHost():wait to exclude host fallback to operator's settings. host 1 shard 0 cluster gid
I0929 11:55:11.896425 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:16.902829 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:21.913913 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:26.921150 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:31.928701 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:36.936718 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:41.945459 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:46.954333 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:51.962841 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:56.971440 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:01.978083 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:06.984911 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996098 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996147 1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:17.002241 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:17.002279 1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:22.008717 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:22.008762 1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:27.015747 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:27.015810 1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:32.024632 1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:32.024713 1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:37.037036 1 schemer.go:137] IsHostInCluster():The host 0-1 is outside of the cluster
I0929 11:56:37.037107 1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.037132 1 worker.go:1015] worker.go:1002:excludeHost():end:exclude host end
I0929 11:56:37.037189 1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:56:37.037281 1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 1/0/gid
I0929 11:56:37.037353 1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.041809 1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.041827 1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:56:37.048773 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-1
I0929 11:56:37.098510 1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-1
I0929 11:56:37.119348 1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:56:37.119427 1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.123489 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:37.127378 1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-1. ClickHouse version: 24.2.1.2248
I0929 11:56:37.131620 1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:56:37.131650 1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-1. Reconcile StatefulSet
I0929 11:56:37.133351 1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:37.168247 1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:56:37.653395 1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-1. Will try to update
I0929 11:56:37.849923 1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491295 1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491349 1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-1
I0929 11:56:38.491418 1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:38.495556 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:38.495593 1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:38.495629 1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:38.495686 1 worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid
I0929 11:56:38.495706 1 worker.go:1057] includeHost():Include into cluster host 1 shard 0 cluster gid
I0929 11:56:38.495726 1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 1 shard 0 cluster gid
I0929 11:56:38.495737 1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:56:38.654056 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:39.689499 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.689543 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.711932 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.711952 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.713061 1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:56:39.851639 1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:39.853763 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:39.853841 1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:39.853942 1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:40.449305 1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:40.460088 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:40.460129 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:40.462470 1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:56:40.849312 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:41.078096 1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:56:41.078003076 +0000 UTC m=+530649.174408624] ProgressHostsCompleted: 2 of 2
I0929 11:56:43.083018 1 worker-chi-reconciler.go:581] Starting rest of shards on workers: 1
I0929 11:56:43.249032 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:43.885956 1 worker-deleter.go:43] clean():gidplatform-dev/gid/click-reconcile-1:remove items scheduled for deletion
I0929 11:56:44.481307 1 worker-deleter.go:46] clean():gidplatform-dev/gid/click-reconcile-1:List of objects which have failed to reconcile:
I0929 11:56:44.481378 1 worker-deleter.go:47] clean():gidplatform-dev/gid/click-reconcile-1:List of successfully reconciled objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/clickhouse-gid
Service: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
PDB: gidplatform-dev/gid-gid
I0929 11:56:45.252969 1 worker-deleter.go:50] clean():gidplatform-dev/gid/click-reconcile-1:Existing objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
PDB: gidplatform-dev/gid-gid
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-1
Service: gidplatform-dev/clickhouse-gid
I0929 11:56:45.253123 1 worker-deleter.go:52] clean():gidplatform-dev/gid/click-reconcile-1:Non-reconciled objects:
I0929 11:56:45.253195 1 worker-deleter.go:68] worker-deleter.go:68:dropReplicas():start:gidplatform-dev/gid/click-reconcile-1:drop replicas based on AP
I0929 11:56:45.253260 1 worker-deleter.go:80] worker-deleter.go:80:dropReplicas():end:gidplatform-dev/gid/click-reconcile-1:processed replicas: 0
I0929 11:56:45.253308 1 worker.go:640] addCHIToMonitoring():gidplatform-dev/gid/click-reconcile-1:add CHI to monitoring
I0929 11:56:45.885652 1 worker.go:595] worker.go:595:waitForIPAddresses():start:gidplatform-dev/gid/click-reconcile-1:wait for IP addresses to be assigned to all pods
I0929 11:56:45.893820 1 worker.go:600] gidplatform-dev/gid/click-reconcile-1:all IP addresses are in place
I0929 11:56:45.893858 1 worker.go:673] worker.go:673:finalizeReconcileAndMarkCompleted():start:gidplatform-dev/gid/click-reconcile-1:finalize reconcile
I0929 11:56:45.904253 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:45.904335 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:45.904391 1 controller.go:617] OK update watch (gidplatform-dev/gid): {"namespace":"gidplatform-dev","name":"gid","labels":{"argocd.argoproj.io/instance":"bi-clickhouse-dev","k8slens-edit-resource-version":"v1"},"annotations":{},"clusters":[{"name":"gid","hosts":[{"name":"0-0","hostname":"chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123},{"name":"0-1","hostname":"chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123}]}]}
I0929 11:56:45.906676 1 worker.go:677] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-2 gid
I0929 11:56:46.249853 1 worker.go:679] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-2 finalize reconcile gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:46.261380 1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:46.261442 1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:46.263792 1 worker.go:683] gidplatform-dev/gid/click-reconcile-1:Update users IPS-2
I0929 11:56:46.449574 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:47.495545 1 worker.go:707] finalizeReconcileAndMarkCompleted():gidplatform-dev/gid/click-reconcile-1:reconcile completed successfully, task id: click-reconcile-1
I0929 11:56:48.077981 1 worker-chi-reconciler.go:134] worker-chi-reconciler.go:60:reconcileCHI():end:gidplatform-dev/gid/click-reconcile-1
I0929 11:56:48.078036 1 worker.go:469] worker.go:432:updateCHI():end:gidplatform-dev/gid/click-reconcile-1
Not sure what going wrong, but on host chi-gid-gid-0-0-0 even no databases copied. And still present only single default.
@alex-zaitsev, could you please look on it?
I0929 11:55:05.495957 [worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid
I0929 11:56:38.495686 [worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid
@Hubbitus is your cluster have 2 shards with only 1 replica inside shard?
Could you share:
kubectl get chi -n gidplatform-de gid -o yaml
without sensitive credentials?
@Slach, thanks to response. We do not use sharding yet.
Output of kubectl get chi -n gidplatform-dev gid -o yaml:
chi.yaml.gz
@Hubbitus
Could you share result of following clickhouse-client query
SELECT database, table, engine_full, count() c FROM cluster('all-sharded',system.tables) WHERE database NOT IN ('system','INFORMATION_SCHEMA','information_schema') GROUP BY ALL HAVING c<2
Sure (limit to 10, total 269):
| database | table | engine_full | c |
|---|---|---|---|
| datamart | appmarket__public__widget | PostgreSQL(appmarket_db, table = 'widget', schema = 'public') |
1 |
| datamart | bonus__public__promotion | PostgreSQL(bonus_db, table = 'promotion', schema = 'public') |
1 |
| sandbox | gid_mt_sessions | ReplicatedMergeTree('/clickhouse/tables/ad0a75c4-1aa7-4386-a542-c16c19f2b2c6/{shard}', '{replica}') ORDER BY tsEvent SETTINGS index_granularity = 8192 | 1 |
| _source | scs__public__story__foreign | PostgreSQL(scs_db, table = 'story', schema = 'public') |
1 |
| datamart | loyalty__public__level | PostgreSQL(loyalty_db, table = 'level', schema = 'public') |
1 |
| _source | feed__public__reaction__foreign | PostgreSQL(feed_db, table = 'reaction', schema = 'public') |
1 |
| _loopback | nomail_account_register | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PARTITION BY tuple() ORDER BY time SETTINGS index_granularity = 8192 | 1 |
| _source | questionnaires__public__anketa_access_group__foreign | PostgreSQL(questionnaire_db, table = 'anketa_access_group', schema = 'public') |
1 |
| sandbox | gid_mt_activities | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY dtEvent SETTINGS index_granularity = 8192 | 1 |
| _source | lms__public__lms_user_courses_progress__foreign | PostgreSQL(lms_db, table = 'lms_user_courses_progress', schema = 'public') |
1 |
You shared logs for 29 sep 2024 since 11:55 UTC is your node lost PVC data before this date, or after this date?
Hello. Last shared logs was from 15 October. And that after one more attempt to recover by deleting PVC and STS.
@Hubbitus https://github.com/Altinity/clickhouse-operator/issues/1455#issuecomment-2381334294 there is logs only for 29 sep 2024
i don't see logs from 15 Oct 2024
I need to ensure you tried to reconcile, after drop PVC and STS did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?
did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?
Yes. By suggestion of @alex-zaitsev I had introduced there TaskID parameter and on each clean attempt increase here number.
share clickhouse-operator logs for 15 Oct related to your changes
Hello.
I do not have so old logs.
But I've switching on branch were set taskID: "click-reconcile-3". It looks like reconcile started automatically.
Relevant part of output (slightly obfuscated)
kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator
operator.2024-11-02T17:47:14+03:00.obfuscated.log
Output of
SELECT database, table, engine_full, count() c, hostname()
FROM
cluster('{cluster}',system.tables)
WHERE
database NOT IN ('system','INFORMATION_SCHEMA','information_schema')
GROUP BY ALL
HAVING c<2
Contains 515 rows. Heading of it:
| database | table | engine_full | c | hostname() |
|---|---|---|---|---|
| datamart | v_subs__public__channel_requests | 1 | chi-gid-gid-0-1-0 | |
| cdc | api__public__reaction | ReplicatedReplacingMergeTree('/clickhouse/{cluster}/cdc/tables/api__public__reaction/{shard}', '{replica}') PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 | 1 | chi-gid-gid-0-1-0 |
| _source | bonus_to_gid__user_mappings | Kafka(kafka_integration, kafka_topic_list = 'dev__bonus_to_gid__user_mappings', kafka_group_name = 'dev__bonus_to_gid__user_mappings') SETTINGS format_avro_schema_registry_url = 'http://gid-integration-partner-kafka.gid.team:8081' | 1 | chi-gid-gid-0-1-0 |
| datamart | v_calendar__public__event_type | 1 | chi-gid-gid-0-1-0 | |
| _raw | api__public__questionnaire_result__dbt_materialized | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 | 1 | chi-gid-gid-0-1-0 |
| datamart | v_jiradatabase__public__ao_54307e_slaauditlog | 1 | chi-gid-gid-0-1-0 | |
| datamart | v_appmarket__public__widget_notification | 1 | chi-gid-gid-0-1-0 | |
| _raw | feed__public__feed_comment__dbt_materialized | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 | 1 | chi-gid-gid-0-1-0 |
| _raw | lms__public__lms_courses_chapters__dbt_materialized | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 | 1 | chi-gid-gid-0-1-0 |
| datamart | tmp_gazprombonus_user_bonus_to_gid_mapping_inner | ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS index_granularity = 1024 | 1 | chi-gid-gid-0-1-0 |
| datamart | api__poll_vote | 1 | chi-gid-gid-0-1-0 | |
| _source | loyalty__public__achievement__foreign | PostgreSQL(loyalty_db, table = 'achievement', schema = 'public') |
1 | chi-gid-gid-0-1-0 |
| datamart | v_calendar__public__like | 1 | chi-gid-gid-0-1-0 | |
| _source | calendar__public__event_type__foreign | PostgreSQL(calendar_db, table = 'event_type', schema = 'public') |
1 | chi-gid-gid-0-1-0 |
according to logs you just triggered reconcile for -0-0-0 when sts is not deleted
try
kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0
kubectl edit chi -n gidplatform-dev gid
edit spec.taskID to manual-4
watch reconciling process again, when sts and pvc not found during reconcile then operator shall propagate schema
Ok, thank you. Doing again:
- Delete STS and PVC:
$ kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted
$ kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0
persistentvolumeclaim "default-volume-claim-chi-gid-gid-0-0-0" deleted
- Pushed commit with
taskID: "click-reconcile-4". Run sync inArgoCDwith prune.
I think relevants logs are:
I1102 16:33:32.889891 1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-4:ActionPlan has actions - continue reconcile
I1102 16:33:32.934904 1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4
I1102 16:33:33.446722 1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I1102 16:33:33.446764 1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I1102 16:33:33.446914 1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I1102 16:33:33.642314 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I1102 16:33:34.443758 1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I1102 16:33:35.086985 1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-4:gidplatform-dev/cluster-gid-gid
I1102 16:33:35.104209 1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I1102 16:33:35.104304 1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I1102 16:33:35.104344 1 worker-chi-reconciler.go:568] starting first shard separately
I1102 16:33:35.104638 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
E1102 16:33:35.112714 1 connection.go:145] QueryContext():FAILED Query(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()
W1102 16:33:35.112777 1 cluster.go:91] QueryAny():FAILED to run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local] skip to next. err: doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host
E1102 16:33:35.112846 1 cluster.go:95] QueryAny():FAILED to run query on all hosts [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
W1102 16:33:35.112926 1 worker-chi-reconciler.go:345] getHostClickHouseVersion():Failed to get ClickHouse version on host: 0-0
W1102 16:33:35.112980 1 worker-chi-reconciler.go:690] reconcileHost():Reconcile Host start. Host: 0-0 Failed to get ClickHouse version: failed to query
W1102 16:33:35.692945 1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0
So operator can't resolve hostname of node:
doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()
Indeed hostname chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local looks strange, clichouse known it in another way:
SELECT cluster, host_name
FROM system.clusters
WHERE cluster = 'gid'
| cluster | host_name |
|---|---|
| gid | chi-gid-gid-0-0 |
| gid | chi-gid-gid-0-1 |
You did not share full logs, just found first error message, error message is expected, because you deleted sts and kubernetes service name will not resolve
Hostname, contains SERVICE name, not pod
is sts chi-gid-gid-0-0-0 and pvc created?
could you share full operator logs?
Sure. I found much more errors in log: operator.2024-11-02T19:40:16+03:00.obfuscated.log
@sunsingerus. according to shared logs
first reconcile was applied at 2024-11-02 14:44:41 and sts + pvs was exists
I1102 14:44:44.267254 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local] I1102 14:44:44.276002 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248 I1102 14:44:44.276073 1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248
second try after sts + pvc deletion
I1102 16:30:39.627295 1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0 I1102 16:33:32.819212 1 controller.go:572] ENQUEUE new ReconcileCHI cmd=update for gidplatform-dev/gid
I1102 16:33:32.934904 1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4
STS and PVC was deleted
W1102 16:33:35.692945 1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0 I1102 16:33:35.839840 1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 0/0/gid I1102 16:33:35.839914 1 worker.go:1010] excludeHost():Exclude from cluster host 0 shard 0 cluster gid I1102 16:33:37.880392 1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0
PVC recreated
I1102 16:33:38.042697 1 worker-chi-reconciler.go:1251] PVC (gidplatform-dev/0-0/default-volume-claim/default-volume-claim-chi-gid-gid-0-0-0) not found and model will not be provided by the operator W1102 16:33:38.042849 1 worker-chi-reconciler.go:1162] PVC is either newly added to the host or was lost earlier (gidplatform-dev/0-0/default-volume-claim/pvc-name-unknown-pvc-not-exist)
Migration force to be applied Start creating Statefulset
I1102 16:33:38.043010 1 worker-chi-reconciler.go:730] reconcileHost():Data loss detected for host: 0-0. Will do force migrate I1102 16:33:38.043073 1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I1102 16:33:38.440090 1 worker.go:1596] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - started I1102 16:33:39.086823 1 creator.go:35] createStatefulSet() I1102 16:33:39.086858 1 creator.go:44] Create StatefulSet gidplatform-dev/chi-gid-gid-0-0
I1102 16:34:09.634311 1 worker.go:1615] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - completed
Prepare for table migration, try
I1102 16:34:11.228817 1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248
Trying drop data from ZK
@sunsingerus
0-0-0 is empty and doesn't contains any table definitions,
so I think, SYSTEM DROP REPLICA 'chi-gid-gid-0-0-0' will do nothing, and this is root cause
I1102 16:34:11.228960 1 schemer.go:56] HostDropReplica():Drop replica: chi-gid-gid-0-0 at 0-0 I1102 16:34:11.236511 1 worker-deleter.go:414] dropReplica():Drop replica host: 0-0 in cluster: gid
Get SQL object definitions
I1102 16:34:12.415941 1 replicated.go:35] shouldCreateReplicatedObjects():SchemaPolicy.Shard says we need replicated objects. Should create replicated objects for the shard: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.416343 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.437761 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.717044 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.756658 1 distributed.go:39] shouldCreateDistributedObjects():Should create distributed objects in the cluster: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.756850 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.786098 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:13.018954 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
Trying to restore and failure cause ZK data still present
I1102 16:34:13.035413 1 schemer.go:98] HostCreateTables():Creating replicated objects at 0-0: [_loopback _raw service _source temp ....] E1102 16:34:13.089822 1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 253, Message: Replica /clickhouse/tables/edf41bd4-46aa-4341-bed7-2e19b838e9e1/0/replicas/chi-gid-gid-0-0 already exists for SQL: CREATE TABLE IF NOT EXISTS _loopback.nomail_account_register UUID 'edf41bd4-46aa-4341-bed7-2e19b838e9e1' .... I1102 16:34:13.089918 1 cluster.go:160] func1():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:Replica is already in ZooKeeper. Trying ATTACH TABLE instead
We need to choose 0-1-0 for execution SYSTEM DROP REPLICA...
@Hubbitus after reconcile most of tables shall restore (via ATTACH) but some of tables, not restored with strange difference
`slo_value` Decimal(15, 5) DEFAULT 0
in zookeeper and
`slo_value` Decimal(15, 5)
in local SQL
E1102 16:36:08.836812 1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 122, Message: Table columns structure in ZooKeeper is different from local table structure. Local columns:
columns format version: 1
14 columns:
...
Zookeeper columns:
columns format version: 1
14 columns:
...
for SQL: CREATE TABLE IF NOT EXISTS _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e' ....
@Hubbitus could you share
kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"
and
kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"
@Slach, sure (comments of columns and table stripped):
$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"
Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old
(
`slo_metric` LowCardinality(String),
`slo_service` LowCardinality(String),
`slo_namespace` LowCardinality(String),
`slo_status` LowCardinality(String),
`slo_method` LowCardinality(String),
`slo_uri` LowCardinality(String),
`slo_le` LowCardinality(String),
`slo_event_ts` DateTime64(6, 'UTC'),
`slo_orig_value` UInt64,
`slo_value` UInt32,
`slo_rec_num` UInt32,
`slo_tags` Map(LowCardinality(String), LowCardinality(String)),
`_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
`__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192
$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"
Received exception from server (version 24.2.1):
Code: 390. DB::Exception: Received from localhost:9000. DB::Exception: Table `victoriametrics__slo__metrics__airflow_hour_agg_old` doesn't exist. (CANNOT_GET_CREATE_TABLE_QUERY)
(query: SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical)
command terminated with exit code 134
Error looks reasonable - we got error on table creation after delete STS and PVC, is not? Maybe some info leaved in zookeeper?
add SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1
kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old SETTTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT Vertical"
$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1"
Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e'
(
`slo_metric` LowCardinality(String),
`slo_service` LowCardinality(String),
`slo_namespace` LowCardinality(String),
`slo_status` LowCardinality(String),
`slo_method` LowCardinality(String),
`slo_uri` LowCardinality(String),
`slo_le` LowCardinality(String),
`slo_event_ts` DateTime64(6, 'UTC'),
`slo_orig_value` UInt64,
`slo_value` UInt32,
`slo_rec_num` UInt32,
`slo_tags` Map(LowCardinality(String), LowCardinality(String)),
`_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
`__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192