clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Re-Creating node from scratch does not copy tables for the Postgres and Kafka engines

Open Hubbitus opened this issue 1 year ago • 56 comments
trafficstars

We use your Operator to manage Clickhouse cluster. Thank you.

After some hardware failure we reset PVC (and zookeeper namespace) to re-create one clickhouse node.

Most of metadata like views, materialized views and tables with most engines (MergeTree, ReplicatedMergeTree etc.) was successfully re-created on the node and replication was started.

Meantime none of Postgres and Kafka based engines tables was recreated. Is it a bug, or we need to use some commands or hacks to sync all metadata across the cluster?

Hubbitus avatar Jul 12 '24 12:07 Hubbitus

@Hubbitus , have you used latest 0.23.6 or earlier release?

alex-zaitsev avatar Jul 18 '24 10:07 alex-zaitsev

@alex-zaitsev, thank you for the response.

That was in older version. Now we have updated operator. What is a correct way to re-init node? Is it enough to just delete PVC of failed node and delete POD?

Hubbitus avatar Jul 24 '24 15:07 Hubbitus

@Hubbitus , if you want to re-init the existing node, delete STS, PVC, PV and start re-concile. Do you have multiple replicas?

alex-zaitsev avatar Jul 30 '24 10:07 alex-zaitsev

@alex-zaitsev, thank you for the reply.

I understand how to delete objects. But what you are meant under "start re-concile"?

I have two replicas chi-gid-gid-0-0-0 and chi-gid-gid-0-1-0. And now chi-gid-gid-0-0-0 is misfunction. I want to re-init it from the data in chi-gid-gid-0-1-0. And that should include sync all:

  • metadata (all type of objects like MergeTree tables, Postgres, kafka engines, materialized views, etc)
  • populate it with data from replica 1
  • Users and all permissions to the objects

Hubbitus avatar Jul 31 '24 12:07 Hubbitus

@Hubbitus , we have released 0.23.7 that is more aggressive re-creating the schema. So you may try to delete PVC/PV completely, and let it to re-create the objects.

alex-zaitsev avatar Aug 15 '24 09:08 alex-zaitsev

@alex-zaitsev, thank you very much! Eventually I get it updated for our cluster:

kub_dev get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" -l app=clickhouse-operator                                                                                                     
altinity/clickhouse-operator:0.23.7 altinity/metrics-exporter:0.23.7

And doing in ArgoCD:

  • Deleted PVC default-volume-claim-chi-gid-gid-0-0-0
  • Deleted pod chi-gid-gid-0-0-0

Then PVC had been re-created.

I see pod is up and running.

  1. But there are a lot of errors like 2024.09.04 23:50:34.382651 [ 712 ] {} <Error> Access(user directories): from: 10.42.9.104, user: data_quality: Authentication failed: Code: 192. DB::Exception: There is no user data_quality in local_directory. (UNKNOWN_USER).... So, users are not copied
  2. Tables looks like also not synced:
SELECT hostname() as node, COUNT(*)
FROM clusterAllReplicas('{cluster}', system.tables)
WHERE database NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system')
GROUP BY node
node count()
chi-gid-gid-0-1-0 620

And also error in log like: 2024.09.04 23:52:49.039132 [ 714 ] {bb628508-db8e-4cf9-8307-a13133a185c9} <Error> PredefinedQueryHandler: Code: 60. DB::Exception: Table system.operator_compatible_metrics does not exist. (UNKNOWN_TABLE) - so even in system database some tables missing...

So, I see only tables in information_schema for the 1-st node.

Hubbitus avatar Sep 04 '24 23:09 Hubbitus

Notes:

  1. Users are not replicated by operator since it can not access sensitive data (like passwords). Use CHI/XML user management or replicated user directory.
<clickhouse>
  <user_directories replace="replace">
    <users_xml>
      <path>/etc/clickhouse-server/users.xml</path>
    </users_xml>
    <replicated>
      <zookeeper_path>/clickhouse/access/</zookeeper_path>
    </replicated>
    <local_directory>
       <path>/var/lib/clickhouse/access/</path>
    </local_directory>
  </user_directories>
</clickhouse>

Note, the order is important, but local_directory may be skipped if you are not using it. Keep it, if there are users defined with CREATE USER already, otherwise they disappear at all.

  1. Tables in system database are not replicated as well, since it is supposed there are no user tables in there.

Others should work, so operator log is needed to check what went wrong.

The correct PVC recovery sequence is:

  1. Delete PVC (or PVC and STS)
  2. Run reconcile adding taskID to CHI, for instance

Looks like since you have deleted PVC and Pod, the recovery has been handled by Kubernetes (STS), and Operator even did not know that PVC has been recreated. So make sure you delete STS as well. Also consider using operator managed persistance:

spec:
  defaults:
    storageManagement:
      provisioner: Operator

alex-zaitsev avatar Sep 20 '24 07:09 alex-zaitsev

@alex-zaitsev, very thank you for the answer. First I would like to recover my tables, then I will try to deal with users.

Today, I eventfully receive rights to see operator pod in kube-system namespace. And just after deletion of PVC and pod I see errors in clickhouse-operator pod:

I0921 22:13:23.555553       1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:23.686901       1 worker.go:266] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Add Pod. gidplatform-dev/chi-gid-gid-0-0-0
I0921 22:13:32.391425       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.391446       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
E0921 22:13:32.394908       1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp 10.42.9.84:8123: connect: connection refused for
SQL: SYSTEM DROP DNS CACHE
W0921 22:13:32.394938       1 retry.go:52] exec():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:FAILED single try. No retries will be made for Applying sqls
I0921 22:13:32.414341       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.414363       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.415447       1 worker.go:387] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:updating endpoints for CHI-1 gid
I0921 22:13:32.450485       1 worker.go:389] gidplatform-dev/gid/b22b39fe-b7d8-40e3-a510-e169d1ffab18:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:32.464127       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:32.464172       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:32.466517       1 worker.go:393] gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update users IPS-1
I0921 22:13:32.481724       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/f2584b3a-a25a-4f22-8dfd-72f2a5166984:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0921 22:13:42.168333       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.168355       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.190633       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.190651       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.191751       1 worker.go:387] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:updating endpoints for CHI-1 gid
I0921 22:13:42.215106       1 worker.go:389] gidplatform-dev/gid/ef8a0da7-09d3-4890-9a59-c760233aedb5:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.84 10.42.5.92]
I0921 22:13:42.224452       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid:Found applicable templates num: 0
I0921 22:13:42.224470       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid:Applied templates num: 0
I0921 22:13:42.225507       1 worker.go:393] gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update users IPS-1
I0921 22:13:42.235027       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/d9105257-3cfe-4596-b3bf-0f6cd6935843:Update ConfigMap gidplatform-dev/chi-gid-common-usersd

Hubbitus avatar Sep 21 '24 22:09 Hubbitus

As we are speaking, I have tried to reconcile cluster by providing:

spec:
  taskID: "click-reconcile-1"

Indeed, that looks like triggering reconcile. Logs of operator pod:

kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator --tail=1000
I0929 11:54:59.076600       1 worker.go:574] ActionPlan start---------------------------------------------:
Diff start -------------------------
modified spec items num: 1
diff item [0]:'.TaskID' = '"click-reconcile-1"'
Diff end -------------------------

ActionPlan end---------------------------------------------
I0929 11:54:59.076655       1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-1:ActionPlan has actions - continue reconcile
I0929 11:54:59.125555       1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-1:reconcile started, task id: click-reconcile-1
I0929 11:54:59.681288       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I0929 11:54:59.681436       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I0929 11:54:59.681607       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:54:59.859367       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:00.648852       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:01.284151       1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-1:gidplatform-dev/cluster-gid-gid
I0929 11:55:01.294688       1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I0929 11:55:01.294746       1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I0929 11:55:01.294769       1 worker-chi-reconciler.go:568] starting first shard separately
I0929 11:55:01.294967       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:01.305993       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:01.306072       1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:01.897135       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:01.897345       1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:02.047624       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:02.047656       1 worker.go:1170] shouldExcludeHost():Host is the same, would not be updated, no need to exclude. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047669       1 worker.go:1005] worker.go:1002:excludeHost():end:exclude host end
I0929 11:55:02.047693       1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:55:02.047730       1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 0/0/gid
I0929 11:55:02.047779       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:02.087023       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:02.087048       1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:55:02.248789       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-0
I0929 11:55:02.884163       1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0
I0929 11:55:03.458635       1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:55:03.458764       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:03.465752       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:03.472628       1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-0. ClickHouse version: 24.2.1.2248
I0929 11:55:03.651853       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-0
I0929 11:55:03.651943       1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-0. Reconcile StatefulSet
I0929 11:55:03.655273       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-0:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:04.097497       1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:55:04.654273       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-0. Will try to update
I0929 11:55:04.853666       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487521       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-0
I0929 11:55:05.487592       1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-0
I0929 11:55:05.487682       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:05.495665       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:05.495739       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:05.495824       1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:05.495957       1 worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid
I0929 11:55:05.496005       1 worker.go:1057] includeHost():Include into cluster host 0 shard 0 cluster gid
I0929 11:55:05.496048       1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 0 shard 0 cluster gid
I0929 11:55:05.496070       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:05.648655       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:06.449496       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
I0929 11:55:06.463606       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248
I0929 11:55:06.463648       1 poller.go:138] Poll():gidplatform-dev/0-0:OK gidplatform-dev/0-0
I0929 11:55:06.463703       1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-0 ClickHouse version running: 24.2.1.2248
I0929 11:55:07.086061       1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:55:07.085979541 +0000 UTC m=+530555.182385088] ProgressHostsCompleted: 1 of 2
I0929 11:55:08.084486       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/clickhouse-gid. Will try to update
I0929 11:55:08.253098       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883102       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/clickhouse-gid
I0929 11:55:08.883295       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:55:08.889935       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:55:08.890015       1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:55:09.524136       1 worker.go:1572] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:55:09.524219       1 worker.go:1001] worker.go:1001:excludeHost():start:exclude host start
I0929 11:55:09.647870       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:55:09.647935       1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 1/0/gid
I0929 11:55:09.647982       1 worker.go:1010] excludeHost():Exclude from cluster host 1 shard 0 cluster gid
I0929 11:55:10.090456       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.090524       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.132801       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.132824       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.134283       1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:55:10.256392       1 worker.go:1099] excludeHostFromClickHouseCluster():going to exclude host 1 shard 0 cluster gid
I0929 11:55:10.256420       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:55:10.651725       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:55:10.847886       1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:55:10.859857       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:55:10.859903       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:55:10.862438       1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:55:11.249384       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:55:11.887237       1 worker.go:1203] shouldWaitExcludeHost():wait to exclude host fallback to operator's settings. host 1 shard 0 cluster gid
I0929 11:55:11.896425       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:16.902829       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:21.913913       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:26.921150       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:31.928701       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:36.936718       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:41.945459       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:46.954333       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:51.962841       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:55:56.971440       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:01.978083       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:06.984911       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996098       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:11.996147       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:17.002241       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:17.002279       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:22.008717       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:22.008762       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:27.015747       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:27.015810       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:32.024632       1 schemer.go:134] IsHostInCluster():The host 0-1 is inside the cluster
I0929 11:56:32.024713       1 poller.go:170] Poll():gidplatform-dev/0-1:WAIT:gidplatform-dev/0-1
I0929 11:56:37.037036       1 schemer.go:137] IsHostInCluster():The host 0-1 is outside of the cluster
I0929 11:56:37.037107       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.037132       1 worker.go:1015] worker.go:1002:excludeHost():end:exclude host end
I0929 11:56:37.037189       1 worker.go:1020] worker.go:1020:completeQueries():start:complete queries start
I0929 11:56:37.037281       1 worker.go:1220] shouldWaitQueries():Will wait for queries to complete according to CHOp config 'reconcile.host.wait.queries' setting. Host is not yet in the cluster. Host/shard/cluster: 1/0/gid
I0929 11:56:37.037353       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.041809       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:37.041827       1 worker.go:1024] worker.go:1021:completeQueries():end:complete queries end
I0929 11:56:37.048773       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-deploy-confd-gid-0-1
I0929 11:56:37.098510       1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-1
I0929 11:56:37.119348       1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start
I0929 11:56:37.119427       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:37.123489       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:37.127378       1 worker-chi-reconciler.go:412] reconcileHostStatefulSet():Reconcile host: 0-1. ClickHouse version: 24.2.1.2248
I0929 11:56:37.131620       1 worker.go:159] shouldForceRestartHost():Host restart is not required. Host: 0-1
I0929 11:56:37.131650       1 worker-chi-reconciler.go:425] reconcileHostStatefulSet():Reconcile host: 0-1. Reconcile StatefulSet
I0929 11:56:37.133351       1 worker.go:1565] getObjectStatusFromMetas():gidplatform-dev/chi-gid-gid-0-1:cur and new objects are equal based on object version label. Update of the object is not required. Object: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:37.168247       1 worker-chi-reconciler.go:445] worker-chi-reconciler.go:407:reconcileHostStatefulSet():end:reconcile StatefulSet end
I0929 11:56:37.653395       1 worker-chi-reconciler.go:900] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service found: gidplatform-dev/chi-gid-gid-0-1. Will try to update
I0929 11:56:37.849923       1 worker.go:1459] updateService():gidplatform-dev/gid/click-reconcile-1:Update Service success: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491295       1 worker-chi-reconciler.go:922] reconcileService():gidplatform-dev/gid/click-reconcile-1:Service reconcile successful: gidplatform-dev/chi-gid-gid-0-1
I0929 11:56:38.491349       1 worker-chi-reconciler.go:461] reconcileHostService():DONE Reconcile service of the host: 0-1
I0929 11:56:38.491418       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:38.495556       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:38.495593       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:38.495629       1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:38.495686       1 worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid
I0929 11:56:38.495706       1 worker.go:1057] includeHost():Include into cluster host 1 shard 0 cluster gid
I0929 11:56:38.495726       1 worker.go:1124] includeHostIntoClickHouseCluster():going to include host 1 shard 0 cluster gid
I0929 11:56:38.495737       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I0929 11:56:38.654056       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:39.689499       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.689543       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.711932       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:39.711952       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:39.713061       1 worker.go:387] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-1 gid
I0929 11:56:39.851639       1 cluster.go:84] Run query on: chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]
I0929 11:56:39.853763       1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-1 version: 24.2.1.2248
I0929 11:56:39.853841       1 poller.go:138] Poll():gidplatform-dev/0-1:OK gidplatform-dev/0-1
I0929 11:56:39.853942       1 worker-chi-reconciler.go:776] reconcileHost():Reconcile Host completed. Host: 0-1 ClickHouse version running: 24.2.1.2248
I0929 11:56:40.449305       1 worker.go:389] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-1 update endpoints gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:40.460088       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:40.460129       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:40.462470       1 worker.go:393] gidplatform-dev/gid/click-reconcile-1:Update users IPS-1
I0929 11:56:40.849312       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:41.078096       1 worker-chi-reconciler.go:797] reconcileHost():[now: 2024-09-29 11:56:41.078003076 +0000 UTC m=+530649.174408624] ProgressHostsCompleted: 2 of 2
I0929 11:56:43.083018       1 worker-chi-reconciler.go:581] Starting rest of shards on workers: 1
I0929 11:56:43.249032       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I0929 11:56:43.885956       1 worker-deleter.go:43] clean():gidplatform-dev/gid/click-reconcile-1:remove items scheduled for deletion
I0929 11:56:44.481307       1 worker-deleter.go:46] clean():gidplatform-dev/gid/click-reconcile-1:List of objects which have failed to reconcile:
I0929 11:56:44.481378       1 worker-deleter.go:47] clean():gidplatform-dev/gid/click-reconcile-1:List of successfully reconciled objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/clickhouse-gid
Service: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
PDB: gidplatform-dev/gid-gid
I0929 11:56:45.252969       1 worker-deleter.go:50] clean():gidplatform-dev/gid/click-reconcile-1:Existing objects:
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-0-0
PVC: gidplatform-dev/default-volume-claim-chi-gid-gid-0-1-0
PDB: gidplatform-dev/gid-gid
StatefulSet: gidplatform-dev/chi-gid-gid-0-0
StatefulSet: gidplatform-dev/chi-gid-gid-0-1
ConfigMap: gidplatform-dev/chi-gid-common-configd
ConfigMap: gidplatform-dev/chi-gid-common-usersd
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-0
ConfigMap: gidplatform-dev/chi-gid-deploy-confd-gid-0-1
Service: gidplatform-dev/chi-gid-gid-0-0
Service: gidplatform-dev/chi-gid-gid-0-1
Service: gidplatform-dev/clickhouse-gid
I0929 11:56:45.253123       1 worker-deleter.go:52] clean():gidplatform-dev/gid/click-reconcile-1:Non-reconciled objects:
I0929 11:56:45.253195       1 worker-deleter.go:68] worker-deleter.go:68:dropReplicas():start:gidplatform-dev/gid/click-reconcile-1:drop replicas based on AP
I0929 11:56:45.253260       1 worker-deleter.go:80] worker-deleter.go:80:dropReplicas():end:gidplatform-dev/gid/click-reconcile-1:processed replicas: 0
I0929 11:56:45.253308       1 worker.go:640] addCHIToMonitoring():gidplatform-dev/gid/click-reconcile-1:add CHI to monitoring
I0929 11:56:45.885652       1 worker.go:595] worker.go:595:waitForIPAddresses():start:gidplatform-dev/gid/click-reconcile-1:wait for IP addresses to be assigned to all pods
I0929 11:56:45.893820       1 worker.go:600] gidplatform-dev/gid/click-reconcile-1:all IP addresses are in place
I0929 11:56:45.893858       1 worker.go:673] worker.go:673:finalizeReconcileAndMarkCompleted():start:gidplatform-dev/gid/click-reconcile-1:finalize reconcile
I0929 11:56:45.904253       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:45.904335       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:45.904391       1 controller.go:617] OK update watch (gidplatform-dev/gid): {"namespace":"gidplatform-dev","name":"gid","labels":{"argocd.argoproj.io/instance":"bi-clickhouse-dev","k8slens-edit-resource-version":"v1"},"annotations":{},"clusters":[{"name":"gid","hosts":[{"name":"0-0","hostname":"chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123},{"name":"0-1","hostname":"chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local","tcpPort":9000,"httpPort":8123}]}]}
I0929 11:56:45.906676       1 worker.go:677] gidplatform-dev/gid/click-reconcile-1:updating endpoints for CHI-2 gid
I0929 11:56:46.249853       1 worker.go:679] gidplatform-dev/gid/click-reconcile-1:IPs of the CHI-2 finalize reconcile gidplatform-dev/gid: len: 2 [10.42.9.86 10.42.5.48]
I0929 11:56:46.261380       1 chi.go:38] prepareListOfTemplates():gidplatform-dev/gid/click-reconcile-1:Found applicable templates num: 0
I0929 11:56:46.261442       1 chi.go:82] ApplyCHITemplates():gidplatform-dev/gid/click-reconcile-1:Applied templates num: 0
I0929 11:56:46.263792       1 worker.go:683] gidplatform-dev/gid/click-reconcile-1:Update users IPS-2
I0929 11:56:46.449574       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-1:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I0929 11:56:47.495545       1 worker.go:707] finalizeReconcileAndMarkCompleted():gidplatform-dev/gid/click-reconcile-1:reconcile completed successfully, task id: click-reconcile-1
I0929 11:56:48.077981       1 worker-chi-reconciler.go:134] worker-chi-reconciler.go:60:reconcileCHI():end:gidplatform-dev/gid/click-reconcile-1
I0929 11:56:48.078036       1 worker.go:469] worker.go:432:updateCHI():end:gidplatform-dev/gid/click-reconcile-1

Not sure what going wrong, but on host chi-gid-gid-0-0-0 even no databases copied. And still present only single default.

Hubbitus avatar Sep 29 '24 12:09 Hubbitus

@alex-zaitsev, could you please look on it?

Hubbitus avatar Oct 13 '24 20:10 Hubbitus

I0929 11:55:05.495957 [worker.go:908] migrateTables():No need to add tables on host 0 to shard 0 in cluster gid

I0929 11:56:38.495686 [worker.go:908] migrateTables():No need to add tables on host 1 to shard 0 in cluster gid

@Hubbitus is your cluster have 2 shards with only 1 replica inside shard?

Could you share: kubectl get chi -n gidplatform-de gid -o yaml without sensitive credentials?

Slach avatar Oct 14 '24 05:10 Slach

@Slach, thanks to response. We do not use sharding yet.

Output of kubectl get chi -n gidplatform-dev gid -o yaml: chi.yaml.gz

Hubbitus avatar Oct 15 '24 07:10 Hubbitus

@Hubbitus Could you share result of following clickhouse-client query SELECT database, table, engine_full, count() c FROM cluster('all-sharded',system.tables) WHERE database NOT IN ('system','INFORMATION_SCHEMA','information_schema') GROUP BY ALL HAVING c<2

Slach avatar Oct 17 '24 05:10 Slach

Sure (limit to 10, total 269):

database table engine_full c
datamart appmarket__public__widget PostgreSQL(appmarket_db, table = 'widget', schema = 'public') 1
datamart bonus__public__promotion PostgreSQL(bonus_db, table = 'promotion', schema = 'public') 1
sandbox gid_mt_sessions ReplicatedMergeTree('/clickhouse/tables/ad0a75c4-1aa7-4386-a542-c16c19f2b2c6/{shard}', '{replica}') ORDER BY tsEvent SETTINGS index_granularity = 8192 1
_source scs__public__story__foreign PostgreSQL(scs_db, table = 'story', schema = 'public') 1
datamart loyalty__public__level PostgreSQL(loyalty_db, table = 'level', schema = 'public') 1
_source feed__public__reaction__foreign PostgreSQL(feed_db, table = 'reaction', schema = 'public') 1
_loopback nomail_account_register ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') PARTITION BY tuple() ORDER BY time SETTINGS index_granularity = 8192 1
_source questionnaires__public__anketa_access_group__foreign PostgreSQL(questionnaire_db, table = 'anketa_access_group', schema = 'public') 1
sandbox gid_mt_activities ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY dtEvent SETTINGS index_granularity = 8192 1
_source lms__public__lms_user_courses_progress__foreign PostgreSQL(lms_db, table = 'lms_user_courses_progress', schema = 'public') 1

Hubbitus avatar Oct 19 '24 12:10 Hubbitus

You shared logs for 29 sep 2024 since 11:55 UTC is your node lost PVC data before this date, or after this date?

Slach avatar Oct 19 '24 12:10 Slach

Hello. Last shared logs was from 15 October. And that after one more attempt to recover by deleting PVC and STS.

Hubbitus avatar Oct 21 '24 07:10 Hubbitus

@Hubbitus https://github.com/Altinity/clickhouse-operator/issues/1455#issuecomment-2381334294 there is logs only for 29 sep 2024

i don't see logs from 15 Oct 2024

I need to ensure you tried to reconcile, after drop PVC and STS did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?

Slach avatar Oct 21 '24 11:10 Slach

did you change in CHI spec.taskID manually to trigger reconcile after delete PVC and STS?

Yes. By suggestion of @alex-zaitsev I had introduced there TaskID parameter and on each clean attempt increase here number.

Hubbitus avatar Oct 23 '24 00:10 Hubbitus

share clickhouse-operator logs for 15 Oct related to your changes

Slach avatar Oct 24 '24 09:10 Slach

Hello.

I do not have so old logs.

But I've switching on branch were set taskID: "click-reconcile-3". It looks like reconcile started automatically. Relevant part of output (slightly obfuscated)

kubectl -n kube-system logs --selector=app=clickhouse-operator --container=clickhouse-operator

operator.2024-11-02T17:47:14+03:00.obfuscated.log

Output of

SELECT database, table, engine_full, count() c, hostname()
FROM
	cluster('{cluster}',system.tables)
WHERE
	database NOT IN ('system','INFORMATION_SCHEMA','information_schema')
GROUP BY ALL
HAVING c<2

Contains 515 rows. Heading of it:

database table engine_full c hostname()
datamart v_subs__public__channel_requests 1 chi-gid-gid-0-1-0
cdc api__public__reaction ReplicatedReplacingMergeTree('/clickhouse/{cluster}/cdc/tables/api__public__reaction/{shard}', '{replica}') PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 1 chi-gid-gid-0-1-0
_source bonus_to_gid__user_mappings Kafka(kafka_integration, kafka_topic_list = 'dev__bonus_to_gid__user_mappings', kafka_group_name = 'dev__bonus_to_gid__user_mappings') SETTINGS format_avro_schema_registry_url = 'http://gid-integration-partner-kafka.gid.team:8081' 1 chi-gid-gid-0-1-0
datamart v_calendar__public__event_type 1 chi-gid-gid-0-1-0
_raw api__public__questionnaire_result__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
datamart v_jiradatabase__public__ao_54307e_slaauditlog 1 chi-gid-gid-0-1-0
datamart v_appmarket__public__widget_notification 1 chi-gid-gid-0-1-0
_raw feed__public__feed_comment__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
_raw lms__public__lms_courses_chapters__dbt_materialized ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS replicated_deduplication_window = 0, index_granularity = 8192 1 chi-gid-gid-0-1-0
datamart tmp_gazprombonus_user_bonus_to_gid_mapping_inner ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY id SETTINGS index_granularity = 1024 1 chi-gid-gid-0-1-0
datamart api__poll_vote 1 chi-gid-gid-0-1-0
_source loyalty__public__achievement__foreign PostgreSQL(loyalty_db, table = 'achievement', schema = 'public') 1 chi-gid-gid-0-1-0
datamart v_calendar__public__like 1 chi-gid-gid-0-1-0
_source calendar__public__event_type__foreign PostgreSQL(calendar_db, table = 'event_type', schema = 'public') 1 chi-gid-gid-0-1-0

Hubbitus avatar Nov 02 '24 15:11 Hubbitus

according to logs you just triggered reconcile for -0-0-0 when sts is not deleted

try

kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0

kubectl edit chi -n gidplatform-dev gid

edit spec.taskID to manual-4 watch reconciling process again, when sts and pvc not found during reconcile then operator shall propagate schema

Slach avatar Nov 02 '24 15:11 Slach

Ok, thank you. Doing again:

  1. Delete STS and PVC:
$ kubectl delete sts -n gidplatform-dev chi-gid-gid-0-0
statefulset.apps "chi-gid-gid-0-0" deleted
$ kubectl delete pvc -n gidplatform-dev -l clickhouse.altinity.com/cluster=gid,clickhouse.altinity.com/shard=0,clickhouse.altinity.com/replica=0
persistentvolumeclaim "default-volume-claim-chi-gid-gid-0-0-0" deleted
  1. Pushed commit with taskID: "click-reconcile-4". Run sync in ArgoCD with prune.

I think relevants logs are:


I1102 16:33:32.889891       1 worker-chi-reconciler.go:89] reconcileCHI():gidplatform-dev/gid/click-reconcile-4:ActionPlan has actions - continue reconcile
I1102 16:33:32.934904       1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4
I1102 16:33:33.446722       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:0|host:0-0
I1102 16:33:33.446764       1 worker.go:820] FOUND host: ns:gidplatform-dev|chi:gid|clu:gid|sha:0|rep:1|host:0-1
I1102 16:33:33.446914       1 worker.go:844] RemoteServersGeneratorOptions: exclude hosts: [], attributes: status: , add: true, remove: false, modify: false, found: false, exclude: true
I1102 16:33:33.642314       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-configd
I1102 16:33:34.443758       1 worker.go:1315] updateConfigMap():gidplatform-dev/gid/click-reconcile-4:Update ConfigMap gidplatform-dev/chi-gid-common-usersd
I1102 16:33:35.086985       1 service.go:86] CreateServiceCluster():gidplatform-dev/gid/click-reconcile-4:gidplatform-dev/cluster-gid-gid
I1102 16:33:35.104209       1 worker-chi-reconciler.go:819] PDB updated: gidplatform-dev/gid-gid
I1102 16:33:35.104304       1 worker-chi-reconciler.go:554] not found ReconcileShardsAndHostsOptionsCtxKey, use empty opts
I1102 16:33:35.104344       1 worker-chi-reconciler.go:568] starting first shard separately
I1102 16:33:35.104638       1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
E1102 16:33:35.112714       1 connection.go:145] QueryContext():FAILED Query(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()
W1102 16:33:35.112777       1 cluster.go:91] QueryAny():FAILED to run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local] skip to next. err: doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host
E1102 16:33:35.112846       1 cluster.go:95] QueryAny():FAILED to run query on all hosts [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local]
W1102 16:33:35.112926       1 worker-chi-reconciler.go:345] getHostClickHouseVersion():Failed to get ClickHouse version on host: 0-0
W1102 16:33:35.112980       1 worker-chi-reconciler.go:690] reconcileHost():Reconcile Host start. Host: 0-0 Failed to get ClickHouse version: failed to query
W1102 16:33:35.692945       1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0

So operator can't resolve hostname of node:

doRequest: transport failed to send a request to ClickHouse: dial tcp: lookup chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local on 10.43.0.10:53: no such host for SQL: SELECT version()

Indeed hostname chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local looks strange, clichouse known it in another way:

SELECT cluster, host_name
FROM system.clusters
WHERE cluster = 'gid'
cluster host_name
gid chi-gid-gid-0-0
gid chi-gid-gid-0-1

Hubbitus avatar Nov 02 '24 16:11 Hubbitus

You did not share full logs, just found first error message, error message is expected, because you deleted sts and kubernetes service name will not resolve

Hostname, contains SERVICE name, not pod

is sts chi-gid-gid-0-0-0 and pvc created?

could you share full operator logs?

Slach avatar Nov 02 '24 17:11 Slach

Sure. I found much more errors in log: operator.2024-11-02T19:40:16+03:00.obfuscated.log

Hubbitus avatar Nov 02 '24 21:11 Hubbitus

@sunsingerus. according to shared logs

first reconcile was applied at 2024-11-02 14:44:41 and sts + pvs was exists

I1102 14:44:44.267254 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local] I1102 14:44:44.276002 1 worker-chi-reconciler.go:349] getHostClickHouseVersion():Get ClickHouse version on host: 0-0 version: 24.2.1.2248 I1102 14:44:44.276073 1 worker-chi-reconciler.go:684] reconcileHost():Reconcile Host start. Host: 0-0 ClickHouse version running: 24.2.1.2248

second try after sts + pvc deletion

I1102 16:30:39.627295 1 worker.go:275] processReconcilePod():gidplatform-dev/chi-gid-gid-0-0-0:Delete Pod. gidplatform-dev/chi-gid-gid-0-0-0 I1102 16:33:32.819212 1 controller.go:572] ENQUEUE new ReconcileCHI cmd=update for gidplatform-dev/gid

I1102 16:33:32.934904 1 worker.go:663] markReconcileStart():gidplatform-dev/gid/click-reconcile-4:reconcile started, task id: click-reconcile-4

STS and PVC was deleted

W1102 16:33:35.692945 1 worker.go:1537] gidplatform-dev/chi-gid-gid-0-0:No cur StatefulSet available but host has an ancestor. Found deleted StatefulSet. for gidplatform-dev/chi-gid-gid-0-0 I1102 16:33:35.839840 1 worker.go:1177] shouldExcludeHost():Host should be excluded. Host/shard/cluster: 0/0/gid I1102 16:33:35.839914 1 worker.go:1010] excludeHost():Exclude from cluster host 0 shard 0 cluster gid I1102 16:33:37.880392 1 worker-chi-reconciler.go:716] reconcileHost():Reconcile PVCs and check possible data loss for host: 0-0

PVC recreated

I1102 16:33:38.042697 1 worker-chi-reconciler.go:1251] PVC (gidplatform-dev/0-0/default-volume-claim/default-volume-claim-chi-gid-gid-0-0-0) not found and model will not be provided by the operator W1102 16:33:38.042849 1 worker-chi-reconciler.go:1162] PVC is either newly added to the host or was lost earlier (gidplatform-dev/0-0/default-volume-claim/pvc-name-unknown-pvc-not-exist)

Migration force to be applied Start creating Statefulset

I1102 16:33:38.043010 1 worker-chi-reconciler.go:730] reconcileHost():Data loss detected for host: 0-0. Will do force migrate I1102 16:33:38.043073 1 worker-chi-reconciler.go:406] worker-chi-reconciler.go:406:reconcileHostStatefulSet():start:reconcile StatefulSet start

I1102 16:33:38.440090 1 worker.go:1596] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - started I1102 16:33:39.086823 1 creator.go:35] createStatefulSet() I1102 16:33:39.086858 1 creator.go:44] Create StatefulSet gidplatform-dev/chi-gid-gid-0-0

I1102 16:34:09.634311 1 worker.go:1615] createStatefulSet():Create StatefulSet gidplatform-dev/chi-gid-gid-0-0 - completed

Prepare for table migration, try

I1102 16:34:11.228817 1 worker-chi-reconciler.go:753] reconcileHost():Check host for ClickHouse availability before migrating tables. Host: 0-0 ClickHouse version running: 24.2.1.2248

Trying drop data from ZK

@sunsingerus 0-0-0 is empty and doesn't contains any table definitions, so I think, SYSTEM DROP REPLICA 'chi-gid-gid-0-0-0' will do nothing, and this is root cause

I1102 16:34:11.228960 1 schemer.go:56] HostDropReplica():Drop replica: chi-gid-gid-0-0 at 0-0 I1102 16:34:11.236511 1 worker-deleter.go:414] dropReplica():Drop replica host: 0-0 in cluster: gid

Get SQL object definitions

I1102 16:34:12.415941 1 replicated.go:35] shouldCreateReplicatedObjects():SchemaPolicy.Shard says we need replicated objects. Should create replicated objects for the shard: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.416343 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.437761 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.717044 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.756658 1 distributed.go:39] shouldCreateDistributedObjects():Should create distributed objects in the cluster: [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.756850 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:12.786098 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local] I1102 16:34:13.018954 1 cluster.go:84] Run query on: chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local of [chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local chi-gid-gid-0-1.gidplatform-dev.svc.cluster.local]

Trying to restore and failure cause ZK data still present

I1102 16:34:13.035413 1 schemer.go:98] HostCreateTables():Creating replicated objects at 0-0: [_loopback _raw service _source temp ....] E1102 16:34:13.089822 1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 253, Message: Replica /clickhouse/tables/edf41bd4-46aa-4341-bed7-2e19b838e9e1/0/replicas/chi-gid-gid-0-0 already exists for SQL: CREATE TABLE IF NOT EXISTS _loopback.nomail_account_register UUID 'edf41bd4-46aa-4341-bed7-2e19b838e9e1' .... I1102 16:34:13.089918 1 cluster.go:160] func1():chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:Replica is already in ZooKeeper. Trying ATTACH TABLE instead

We need to choose 0-1-0 for execution SYSTEM DROP REPLICA...

Slach avatar Nov 03 '24 05:11 Slach

@Hubbitus after reconcile most of tables shall restore (via ATTACH) but some of tables, not restored with strange difference

`slo_value` Decimal(15, 5)	DEFAULT	0

in zookeeper and

`slo_value` Decimal(15, 5)	

in local SQL

E1102 16:36:08.836812       1 connection.go:194] Exec():FAILED Exec(http://test_operator:***@chi-gid-gid-0-0.gidplatform-dev.svc.cluster.local:8123/) Code: 122, Message: Table columns structure in ZooKeeper is different from local table structure. Local columns:
columns format version: 1
14 columns:
...
Zookeeper columns:
columns format version: 1
14 columns:
...
for SQL: CREATE TABLE IF NOT EXISTS _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e' ....
 

Slach avatar Nov 03 '24 05:11 Slach

@Hubbitus could you share

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"

and

kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMA Vertical"

Slach avatar Nov 03 '24 05:11 Slach

@Slach, sure (comments of columns and table stripped):

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"

Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old
(
    `slo_metric` LowCardinality(String),
    `slo_service` LowCardinality(String),
    `slo_namespace` LowCardinality(String),
    `slo_status` LowCardinality(String),
    `slo_method` LowCardinality(String),
    `slo_uri` LowCardinality(String),
    `slo_le` LowCardinality(String),
    `slo_event_ts` DateTime64(6, 'UTC'),
    `slo_orig_value` UInt64,
    `slo_value` UInt32,
    `slo_rec_num` UInt32,
    `slo_tags` Map(LowCardinality(String), LowCardinality(String)),
    `_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
    `__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192
$ kubectl exec -n gidplatform-dev chi-gid-gid-0-0-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical"
Received exception from server (version 24.2.1):
Code: 390. DB::Exception: Received from localhost:9000. DB::Exception: Table `victoriametrics__slo__metrics__airflow_hour_agg_old` doesn't exist. (CANNOT_GET_CREATE_TABLE_QUERY)
(query: SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical)
command terminated with exit code 134

Error looks reasonable - we got error on table creation after delete STS and PVC, is not? Maybe some info leaved in zookeeper?

Hubbitus avatar Nov 03 '24 15:11 Hubbitus

add SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1

kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old SETTTINGS show_table_uuid_in_table_create_query_if_not_nil=1 FORMAT Vertical"

Slach avatar Nov 03 '24 15:11 Slach

$ kubectl exec -n gidplatform-dev chi-gid-gid-0-1-0 -- clickhouse-client -q "SHOW CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old FORMAT Vertical SETTINGS show_table_uuid_in_table_create_query_if_not_nil=1"
Row 1:
──────
statement: CREATE TABLE _raw.victoriametrics__slo__metrics__airflow_hour_agg_old UUID '46a1c218-9274-446d-9300-e644bbd4cc0e'
(
    `slo_metric` LowCardinality(String),
    `slo_service` LowCardinality(String),
    `slo_namespace` LowCardinality(String),
    `slo_status` LowCardinality(String),
    `slo_method` LowCardinality(String),
    `slo_uri` LowCardinality(String),
    `slo_le` LowCardinality(String),
    `slo_event_ts` DateTime64(6, 'UTC'),
    `slo_orig_value` UInt64,
    `slo_value` UInt32,
    `slo_rec_num` UInt32,
    `slo_tags` Map(LowCardinality(String), LowCardinality(String)),
    `_row_hash_` UInt64 MATERIALIZED cityHash64(slo_metric, slo_service, slo_namespace, slo_status, slo_method, slo_uri, slo_event_ts, slo_value, slo_rec_num, slo_tags),
    `__insert_ts` DateTime64(6, 'UTC') DEFAULT now64(6, 'UTC')
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}')
ORDER BY (slo_metric, slo_service, slo_namespace, slo_event_ts)
SETTINGS index_granularity = 8192

Hubbitus avatar Nov 03 '24 17:11 Hubbitus