clickhouse-operator Operator does not ensure that a clickhouse keeper pod is running before proceeding with the restart of another pod

Hi,

I'm using clickhouse-operator version 0.24.0 and I've encountered the following issue:

When applying new change to clickhouseKeeper cluster operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod (even though the previous one is still being created).

Let's look at the status of the pods when I changed ClickHouseKeeperInstallation:

Cluster is applying the new change:

NAME                          READY   STATUS        RESTARTS   AGE
chk-extended-cluster1-0-0-0   1/1     Terminating   0          6m10s
chk-extended-cluster1-0-1-0   1/1     Running       0          6m10s
chk-extended-cluster1-0-2-0   1/1     Running       0          5m25s
(...)
NAME                          READY   STATUS              RESTARTS   AGE
chk-extended-cluster1-0-0-0   0/1     ContainerCreating   0          65s
chk-extended-cluster1-0-1-0   1/1     Running             0          7m19s
chk-extended-cluster1-0-2-0   1/1     Running             0          6m34s
(...)
NAME                          READY   STATUS    RESTARTS   AGE
chk-extended-cluster1-0-0-0   1/1     Running   0          78s
chk-extended-cluster1-0-1-0   1/1     Running   0          7m32s
chk-extended-cluster1-0-2-0   1/1     Running   0          6m47s

So far so good, but let's see what happens next:

NAME                          READY   STATUS        RESTARTS   AGE
chk-extended-cluster1-0-0-0   1/1     Running       0          81s
chk-extended-cluster1-0-1-0   1/1     Terminating   0          7m35s
chk-extended-cluster1-0-2-0   1/1     Running       0          6m50s
(...)
NAME                          READY   STATUS    RESTARTS   AGE
chk-extended-cluster1-0-0-0   1/1     Running   0          82s
chk-extended-cluster1-0-2-0   1/1     Running   0          6m51s

and here goes our problem

NAME                          READY   STATUS              RESTARTS   AGE
chk-extended-cluster1-0-0-0   1/1     Running             0          86s
chk-extended-cluster1-0-1-0   0/1     ContainerCreating   0          1s
chk-extended-cluster1-0-2-0   1/1     Terminating         0          6m55s

As you can see, pod cluster1-0-1-0 is still in ContainerCreating state, but the operator has already decided to terminate pod cluster1-0-2-0.

This caused the cluster to lose quorum for a short time, which ClickHouse did not liked, resulting in the following error:

error": "(CreateMemoryTableQueryOnCluster) Error when executing query: code: 999, message: All connection tries failed while connecting to ZooKeeper. nodes: 10.233.71.16:9181, 10.233.81.20:9181, 10.233.70.35:9181\nCode: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or is stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.8.2.3 (official build)), 10.233.71.16:9181\nPoco::Exception. Code: 1000, e.code() = 111, Connection refused (version 24.8.2.3 (official build))

I was expecting that clickhouse keeper cluster will apply new changes without any disruptions to clickhouse cluster.

Dec 12 '24 16:12 mandreasik

@mandreasik , please use 0.24.2 or later. It works differently, though still allows a downtime. 1st and 2nd nodes are restarted almost at the same time:

test-049-2c940a34-c428-11ef-9ccd-acde48001122   chk-clickhouse-keeper-test-0-0-0                       1/1     Running   0                57s
test-049-2c940a34-c428-11ef-9ccd-acde48001122   chk-clickhouse-keeper-test-0-1-0                       1/1     Running   0                54s
test-049-2c940a34-c428-11ef-9ccd-acde48001122   chk-clickhouse-keeper-test-0-2-0                       1/1     Running   0                22s

Dec 27 '24 08:12 alex-zaitsev

@alex-zaitsev I've tried version 0.24.2 as suggested and found another issue. I've tested it on k3d 1.28.x and also on 1.29.x, and I got different results, but both show very weird behavior.

Steps to reproduce:

Create a ClickHouse Keeper cluster using the example configuration provided here:

apiVersion: "clickhouse-keeper.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
  name: extended
spec:
  configuration:
    clusters:
      - name: "cluster1"
        layout:
          replicasCount: 3
    settings:
      logger/level: "trace"
      logger/console: "true"
      listen_host: "0.0.0.0"
      keeper_server/four_letter_word_white_list: "*"
      keeper_server/coordination_settings/raft_logs_level: "information"
      prometheus/endpoint: "/metrics"
      prometheus/port: "7000"
      prometheus/metrics: "true"
      prometheus/events: "true"
      prometheus/asynchronous_metrics: "true"
      prometheus/status_info: "false"
      keeper_server/coordination_settings/force_sync: "false"

  defaults:
    templates:
      # Templates are specified as default for all clusters
      podTemplate: default
      dataVolumeClaimTemplate: default

  templates:
    podTemplates:
      - name: default
        metadata:
          labels:
            app: clickhouse-keeper
        spec:
          containers:
            - name: clickhouse-keeper
              imagePullPolicy: IfNotPresent
              image: "clickhouse/clickhouse-keeper:latest"
              resources:
                requests:
                  memory: "256M"
                  cpu: "1"
                limits:
                  memory: "4Gi"
                  cpu: "2"
          securityContext:
            fsGroup: 101

    volumeClaimTemplates:
      - name: default
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi

I've deleted the affinity section from the example to allow pods to deploy on the same node.

Wait until the CHK cluster is in a "Completed" state (kubectl get chk).
Modify one key in the settings section. In my case, it was keeper_server/coordination_settings/force_sync: "false", which I changed from false to true.
Observe what happens with the cluster.

If the cluster starts without any issues, repeat steps 3 to 4 again.

From my tests, I observed two outcomes:

After the cluster was modified, the operator restarted all pods twice. It waited for the pods to be alive and then decided to perform a second round of restarts.
After the cluster was modified, one or more StatefulSets were scaled down to 0 but never scaled back up. The CHK status remained Completed.

Could you perform such tests on you environment?

Jan 03 '25 12:01 mandreasik

@alex-zaitsev @mandreasik I have the same issue in 25.3 op version. Here the issue. What is more critical is that during a upgrade of chk chi pods loose comms with chk and tables get readonly in chi. ... The workaround I found is to just annotate the chi jnstallation with stop: yes, wait chi pods to terminate and after that upgrade chk so it does not matter . But this is not normal. Seems like operator does not respect the chk pdb as it does for the chi pdb.

Aug 19 '25 21:08 Tchirana