Operator does not ensure that a clickhouse keeper pod is running before proceeding with the restart of another pod
Hi,
I'm using clickhouse-operator version 0.24.0 and I've encountered the following issue:
When applying new change to clickhouseKeeper cluster operator does not ensure that a ClickHouseKeeper pod is running before proceeding with the restart of another pod (even though the previous one is still being created).
Let's look at the status of the pods when I changed ClickHouseKeeperInstallation:
Cluster is applying the new change:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Terminating 0 6m10s
chk-extended-cluster1-0-1-0 1/1 Running 0 6m10s
chk-extended-cluster1-0-2-0 1/1 Running 0 5m25s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 0/1 ContainerCreating 0 65s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m19s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m34s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 78s
chk-extended-cluster1-0-1-0 1/1 Running 0 7m32s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m47s
So far so good, but let's see what happens next:
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 81s
chk-extended-cluster1-0-1-0 1/1 Terminating 0 7m35s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m50s
(...)
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 82s
chk-extended-cluster1-0-2-0 1/1 Running 0 6m51s
and here goes our problem
NAME READY STATUS RESTARTS AGE
chk-extended-cluster1-0-0-0 1/1 Running 0 86s
chk-extended-cluster1-0-1-0 0/1 ContainerCreating 0 1s
chk-extended-cluster1-0-2-0 1/1 Terminating 0 6m55s
As you can see, pod cluster1-0-1-0 is still in ContainerCreating state, but the operator has already decided to terminate pod cluster1-0-2-0.
This caused the cluster to lose quorum for a short time, which ClickHouse did not liked, resulting in the following error:
error": "(CreateMemoryTableQueryOnCluster) Error when executing query: code: 999, message: All connection tries failed while connecting to ZooKeeper. nodes: 10.233.71.16:9181, 10.233.81.20:9181, 10.233.70.35:9181\nCode: 999. Coordination::Exception: Keeper server rejected the connection during the handshake. Possibly it's overloaded, doesn't see leader or is stale: while receiving handshake from ZooKeeper. (KEEPER_EXCEPTION) (version 24.8.2.3 (official build)), 10.233.71.16:9181\nPoco::Exception. Code: 1000, e.code() = 111, Connection refused (version 24.8.2.3 (official build))
I was expecting that clickhouse keeper cluster will apply new changes without any disruptions to clickhouse cluster.
@mandreasik , please use 0.24.2 or later. It works differently, though still allows a downtime. 1st and 2nd nodes are restarted almost at the same time:
test-049-2c940a34-c428-11ef-9ccd-acde48001122 chk-clickhouse-keeper-test-0-0-0 1/1 Running 0 57s
test-049-2c940a34-c428-11ef-9ccd-acde48001122 chk-clickhouse-keeper-test-0-1-0 1/1 Running 0 54s
test-049-2c940a34-c428-11ef-9ccd-acde48001122 chk-clickhouse-keeper-test-0-2-0 1/1 Running 0 22s
@alex-zaitsev I've tried version 0.24.2 as suggested and found another issue. I've tested it on k3d 1.28.x and also on 1.29.x, and I got different results, but both show very weird behavior.
Steps to reproduce:
- Create a ClickHouse Keeper cluster using the example configuration provided here:
apiVersion: "clickhouse-keeper.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
name: extended
spec:
configuration:
clusters:
- name: "cluster1"
layout:
replicasCount: 3
settings:
logger/level: "trace"
logger/console: "true"
listen_host: "0.0.0.0"
keeper_server/four_letter_word_white_list: "*"
keeper_server/coordination_settings/raft_logs_level: "information"
prometheus/endpoint: "/metrics"
prometheus/port: "7000"
prometheus/metrics: "true"
prometheus/events: "true"
prometheus/asynchronous_metrics: "true"
prometheus/status_info: "false"
keeper_server/coordination_settings/force_sync: "false"
defaults:
templates:
# Templates are specified as default for all clusters
podTemplate: default
dataVolumeClaimTemplate: default
templates:
podTemplates:
- name: default
metadata:
labels:
app: clickhouse-keeper
spec:
containers:
- name: clickhouse-keeper
imagePullPolicy: IfNotPresent
image: "clickhouse/clickhouse-keeper:latest"
resources:
requests:
memory: "256M"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
securityContext:
fsGroup: 101
volumeClaimTemplates:
- name: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
I've deleted the affinity section from the example to allow pods to deploy on the same node.
- Wait until the CHK cluster is in a "Completed" state (kubectl get chk).
- Modify one key in the settings section. In my case, it was
keeper_server/coordination_settings/force_sync: "false", which I changed from false to true. - Observe what happens with the cluster.
If the cluster starts without any issues, repeat steps 3 to 4 again.
From my tests, I observed two outcomes:
- After the cluster was modified, the operator restarted all pods twice. It waited for the pods to be alive and then decided to perform a second round of restarts.
- After the cluster was modified, one or more StatefulSets were scaled down to 0 but never scaled back up. The CHK status remained
Completed.
Could you perform such tests on you environment?
@alex-zaitsev @mandreasik I have the same issue in 25.3 op version. Here the issue. What is more critical is that during a upgrade of chk chi pods loose comms with chk and tables get readonly in chi. ... The workaround I found is to just annotate the chi jnstallation with stop: yes, wait chi pods to terminate and after that upgrade chk so it does not matter . But this is not normal. Seems like operator does not respect the chk pdb as it does for the chi pdb.