clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

A clickhouse-keeper node cannot get up after installation

Open liubo-it opened this issue 1 year ago • 6 comments

image

liubo-it avatar Aug 02 '24 08:08 liubo-it

@Slach Can you help me? I'm following the documentation

liubo-it avatar Aug 02 '24 08:08 liubo-it

please, stop share text as images, this is mental degradation

Which instruction did you follow exactly, share link?

Slach avatar Aug 02 '24 11:08 Slach

please, stop share text as images, this is mental degradation

Which instruction did you follow exactly, share link?

sry,I refer to the following document to deploy clickhouse-keeper, I get an error when I start clickhouse-keeper-02 pod

error

4.08.03 05:14:40.671867 [ 22 ] {} <Debug> KeeperSnapshotManagerS3: Shutting down KeeperSnapshotManagerS3
2024.08.03 05:14:40.671899 [ 22 ] {} <Information> KeeperSnapshotManagerS3: KeeperSnapshotManagerS3 shut down
2024.08.03 05:14:40.671911 [ 22 ] {} <Debug> KeeperDispatcher: Dispatcher shut down
2024.08.03 05:14:40.672404 [ 22 ] {} <Error> Application: Code: 568. DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>). (RAFT_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e42fdb in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<char const (&) [88]>(int, char const (&) [88]) @ 0x000000000086a740 in /usr/bin/clickhouse-keeper
2. DB::KeeperStateManager::parseServersConfiguration(Poco::Util::AbstractConfiguration const&, bool, bool) const @ 0x0000000000869595 in /usr/bin/clickhouse-keeper
3. DB::KeeperStateManager::KeeperStateManager(int, String const&, String const&, Poco::Util::AbstractConfiguration const&, std::shared_ptr<DB::CoordinationSettings> const&, std::shared_ptr<DB::KeeperContext>) @ 0x000000000086b08b in /usr/bin/clickhouse-keeper
4. DB::KeeperServer::KeeperServer(std::shared_ptr<DB::KeeperConfigurationAndSettings> const&, Poco::Util::AbstractConfiguration const&, ConcurrentBoundedQueue<DB::KeeperStorage::ResponseForSession>&, ConcurrentBoundedQueue<DB::CreateSnapshotTask>&, std::shared_ptr<DB::KeeperContext>, DB::KeeperSnapshotManagerS3&, std::function<void (unsigned long, DB::KeeperStorage::RequestForSession const&)>) @ 0x0000000000802bc1 in /usr/bin/clickhouse-keeper
5. DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool, bool, std::shared_ptr<DB::Macros const> const&) @ 0x00000000007e81c6 in /usr/bin/clickhouse-keeper
6. DB::Context::initializeKeeperDispatcher(bool) const @ 0x0000000000a5bb06 in /usr/bin/clickhouse-keeper
7. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b771e9 in /usr/bin/clickhouse-keeper
8. Poco::Util::Application::run() @ 0x0000000000ffbf26 in /usr/bin/clickhouse-keeper
9. DB::Keeper::run() @ 0x0000000000b73f7e in /usr/bin/clickhouse-keeper
10. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000001012d39 in /usr/bin/clickhouse-keeper
11. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b72ef8 in /usr/bin/clickhouse-keeper
12. main @ 0x0000000000b81b1d in /usr/bin/clickhouse-keeper
 (version 23.10.5.20 (official build))
2024.08.03 05:14:40.672441 [ 22 ] {} <Error> Application: DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>)
2024.08.03 05:14:40.672446 [ 22 ] {} <Information> Application: shutting down
2024.08.03 05:14:40.672449 [ 22 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem
2024.08.03 05:14:40.672565 [ 23 ] {} <Trace> BaseDaemon: Received signal -2
2024.08.03 05:14:40.672601 [ 23 ] {} <Information> BaseDaemon: Stop SignalListener thread

reference file https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml

describe image

k8s resource file

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: wukong-clickhouse-keeper-local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-0
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-0
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data0
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-1
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-1
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data1
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: clickhouse-keeper-local-pv-2
  namespace:  wukong-application
  labels:
    name: clickhouse-keeper-local-pv-2
spec:
  capacity:
    storage: 50Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: wukong-clickhouse-keeper-local-storage
  hostPath:
    path: /data/tingyun/wukong/tingyun/common/clickhouse-keeper/data2
    type: DirectoryOrCreate
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - 10.128.9.10
---
apiVersion: v1
kind: Service
metadata:
  name: wukong-clickhouse-keeper-hs
  namespace: wukong-application
  labels:
    app: wukong-clickhouse-keeper
spec:
  ports:
  - port:  9234
    name: raft
  clusterIP: None
  selector:
    app: wukong-clickhouse-keeper
---
apiVersion: v1
kind: Service
metadata:
  name: wukong-clickhouse-keeper
  namespace: wukong-application
  labels:
    app: wukong-clickhouse-keeper
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
    prometheus.io/port: "9363"
    prometheus.io/scrape: "true"
spec:
  ports:
  - port: 2181
    name: client
  - port: 9363
    name: prometheus
  selector:
    app: wukong-clickhouse-keeper
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: wukong-clickhouse-keeper
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper
data:
  keeper_config.xml: |
    <clickhouse>
        <include_from>/tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml</include_from>
        <logger>
            <level>trace</level>
            <console>true</console>
        </logger>
        <listen_host>::</listen_host>
        <keeper_server incl="keeper_server">
            <enable_reconfiguration>true</enable_reconfiguration>
            <path>/var/lib/clickhouse-keeper</path>
            <tcp_port>2181</tcp_port>
            <four_letter_word_white_list>*</four_letter_word_white_list>
            <coordination_settings>
                <!-- <raft_logs_level>trace</raft_logs_level> -->
                <raft_logs_level>information</raft_logs_level>
            </coordination_settings>
        </keeper_server>
        <prometheus>
            <endpoint>/metrics</endpoint>
            <port>9363</port>
            <metrics>true</metrics>
            <events>true</events>
            <asynchronous_metrics>true</asynchronous_metrics>
            <status_info>true</status_info>
        </prometheus>
    </clickhouse>
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: wukong-clickhouse-keeper-scripts
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper-scripts
data:
  env.sh: |
    #!/usr/bin/env bash
    export DOMAIN=`hostname -d`
    export CLIENT_HOST=clickhouse-keeper
    export CLIENT_PORT=2181
    export RAFT_PORT=9234
  keeperFunctions.sh: |
    #!/usr/bin/env bash
    set -ex
    function keeperConfig() {
      echo "$HOST.$DOMAIN:$RAFT_PORT;$ROLE;$WEIGHT"
    }
    function keeperConnectionString() {
      # If the client service address is not yet available, then return localhost
      set +e
      getent hosts "${CLIENT_HOST}" 2>/dev/null 1>/dev/null
      if [[ $? -ne 0 ]]; then
        set -e
        echo "-h localhost -p ${CLIENT_PORT}"
      else
        set -e
        echo "-h ${CLIENT_HOST} -p ${CLIENT_PORT}"
      fi
    }

  keeperStart.sh: |
    #!/usr/bin/env bash
    set -ex
    source /conf/env.sh
    source /conf/keeperFunctions.sh

    HOST=`hostname -s`
    if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
      NAME=${BASH_REMATCH[1]}
      ORD=${BASH_REMATCH[2]}
    else
      echo Failed to parse name and ordinal of Pod
      exit 1
    fi
    export MY_ID=$((ORD+1))
    set +e
    getent hosts $DOMAIN
    if [[ $? -eq 0 ]]; then
      ACTIVE_ENSEMBLE=true
    else
      ACTIVE_ENSEMBLE=false
    fi
    set -e
    mkdir -p /tmp/clickhouse-keeper/config.d/
    if [[ "true" == "${ACTIVE_ENSEMBLE}" ]]; then
      # get current config from clickhouse-keeper
      CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h ${CLIENT_HOST} -p ${CLIENT_PORT} -q "get /keeper/config" || true)
      # generate dynamic config, add current server to xml
      {
        echo "<yandex><keeper_server>"
        echo "<server_id>${MY_ID}</server_id>"
        echo "<raft_configuration>"
        if [[ "0" == $(echo "${CURRENT_KEEPER_CONFIG}" | grep -c "${HOST}.${DOMAIN}") ]]; then
          echo "<server><id>${MY_ID}</id><hostname>${HOST}.${DOMAIN}</hostname><port>${RAFT_PORT}</port><priority>1</priority><start_as_follower>true</start_as_follower></server>"
        fi
        while IFS= read -r line; do
          id=$(echo "$line" | cut -d '=' -f 1 | cut -d '.' -f 2)
          if [[ "" != "${id}" ]]; then
            hostname=$(echo "$line" | cut -d '=' -f 2 | cut -d ';' -f 1 | cut -d ':' -f 1)
            port=$(echo "$line" | cut -d '=' -f 2 | cut -d ';' -f 1 | cut -d ':' -f 2)
            priority=$(echo "$line" | cut -d ';' -f 3)
            priority=${priority:-1}
            port=${port:-$RAFT_PORT}
            echo "<server><id>$id</id><hostname>$hostname</hostname><port>$port</port><priority>$priority</priority></server>"
          fi
        done <<< "$CURRENT_KEEPER_CONFIG"
        echo "</raft_configuration>"
        echo "</keeper_server></yandex>"
      } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    else
      # generate dynamic config, add current server to xml
      {
        echo "<yandex><keeper_server>"
        echo "<server_id>${MY_ID}</server_id>"
        echo "<raft_configuration>"
        echo "<server><id>${MY_ID}</id><hostname>${HOST}.${DOMAIN}</hostname><port>${RAFT_PORT}</port><priority>1</priority></server>"
        echo "</raft_configuration>"
        echo "</keeper_server></yandex>"
      } > /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    fi

    # run clickhouse-keeper
    cat /tmp/clickhouse-keeper/config.d/generated-keeper-settings.xml
    rm -rfv /var/lib/clickhouse-keeper/terminated
    clickhouse-keeper --config-file=/etc/clickhouse-keeper/keeper_config.xml

  keeperTeardown.sh: |
    #!/usr/bin/env bash
    set -ex
    exec > /proc/1/fd/1
    exec 2> /proc/1/fd/2
    source /conf/env.sh
    source /conf/keeperFunctions.sh
    set +e
    KEEPER_URL=$(keeperConnectionString)
    set -e
    HOST=`hostname -s`
    if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
        NAME=${BASH_REMATCH[1]}
        ORD=${BASH_REMATCH[2]}
    else
        echo Failed to parse name and ordinal of Pod
        exit 1
    fi
    export MY_ID=$((ORD+1))

    CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h localhost -p ${CLIENT_PORT} -q "get /keeper/config")
    CLUSTER_SIZE=$(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E '^server\.[0-9]+=')
    echo "CLUSTER_SIZE=$CLUSTER_SIZE, MyId=$MY_ID"
    # If CLUSTER_SIZE > 1, this server is being permanently removed from raft_configuration.
    if [[ "$CLUSTER_SIZE" -gt "1" ]]; then
      clickhouse-keeper-client --history-file=/dev/null -q "reconfig remove $MY_ID" ${KEEPER_URL}
    fi

    # Wait to remove $MY_ID from quorum
    # for (( i = 0; i < 6; i++ )); do
    #    CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h localhost -p ${CLIENT_PORT} -q "get /keeper/config")
    #    if [[ "0" == $(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E "^server.${MY_ID}=$HOST.+participant;[0-1]$") ]]; then
    #      echo "$MY_ID removed from quorum"
    #      break
    #    else
    #      echo "$MY_ID still present in quorum"
    #    fi
    #    sleep 1
    # done

    # Wait for client connections to drain. Kubernetes will wait until the configured
    # "terminationGracePeriodSeconds" before forcibly killing the container
    for (( i = 0; i < 3; i++ )); do
      CONN_COUNT=`echo $(exec 3<>/dev/tcp/127.0.0.1/2181 ; printf "cons" >&3 ; IFS=; tee <&3; exec 3<&- ;) | grep -v "^$" | grep -v "127.0.0.1" | wc -l`
      if [[ "$CONN_COUNT" -gt "0" ]]; then
        echo "$CONN_COUNT non-local connections still connected."
        sleep 1
      else
        echo "$CONN_COUNT non-local connections"
        break
      fi
    done

    touch /var/lib/clickhouse-keeper/terminated
    # Kill the primary process ourselves to circumvent the terminationGracePeriodSeconds
    ps -ef | grep clickhouse-keeper | grep -v grep | awk '{print $1}' | xargs kill


  keeperLive.sh: |
    #!/usr/bin/env bash
    set -ex
    source /conf/env.sh
    OK=$(exec 3<>/dev/tcp/127.0.0.1/${CLIENT_PORT} ; printf "ruok" >&3 ; IFS=; tee <&3; exec 3<&- ;)
    # Check to see if keeper service answers
    if [[ "$OK" == "imok" ]]; then
      exit 0
    else
      exit 1
    fi

  keeperReady.sh: |
    #!/usr/bin/env bash
    set -ex
    exec > /proc/1/fd/1
    exec 2> /proc/1/fd/2
    source /conf/env.sh
    source /conf/keeperFunctions.sh

    HOST=`hostname -s`

    # Check to see if clickhouse-keeper service answers
    set +e
    getent hosts $DOMAIN
    if [[ $? -ne 0 ]]; then
      echo "no active DNS records in service, first running pod"
      exit 0
    elif [[ -f /var/lib/clickhouse-keeper/terminated ]]; then
      echo "termination in progress"
      exit 0
    else
      set -e
      # An ensemble exists, check to see if this node is already a member.
      # Extract resource name and this members' ordinal value from pod hostname
      if [[ $HOST =~ (.*)-([0-9]+)$ ]]; then
        NAME=${BASH_REMATCH[1]}
        ORD=${BASH_REMATCH[2]}
      else
        echo "Failed to parse name and ordinal of Pod"
        exit 1
      fi
      MY_ID=$((ORD+1))

      CURRENT_KEEPER_CONFIG=$(clickhouse-keeper-client --history-file=/dev/null -h ${CLIENT_HOST} -p ${CLIENT_PORT} -q "get /keeper/config" || exit 0)
      # Check to see if clickhouse-keeper for this node is a participant in raft cluster
      if [[ "1" == $(echo -e "${CURRENT_KEEPER_CONFIG}" | grep -c -E "^server.${MY_ID}=${HOST}.+participant;1$") ]]; then
        echo "clickhouse-keeper instance is available and an active participant"
        exit 0
      else
        echo "clickhouse-keeper instance is ready to add as participant with 1 weight."

        ROLE=participant
        WEIGHT=1
        KEEPER_URL=$(keeperConnectionString)
        NEW_KEEPER_CONFIG=$(keeperConfig)
        clickhouse-keeper-client --history-file=/dev/null -q "reconfig add 'server.$MY_ID=$NEW_KEEPER_CONFIG'" ${KEEPER_URL}
        exit 0
      fi
    fi
---
# Setup ClickHouse Keeper StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  # nodes would be named as clickhouse-keeper-0, clickhouse-keeper-1, clickhouse-keeper-2
  name: wukong-clickhouse-keeper
  namespace:  wukong-application
  labels:
    app: wukong-clickhouse-keeper
spec:
  selector:
    matchLabels:
      app: wukong-clickhouse-keeper
  serviceName:  wukong-clickhouse-keeper-hs
  replicas: 3
  template:
    metadata:
      labels:
        app: wukong-clickhouse-keeper
      annotations:
        prometheus.io/port: '9363'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: wukong-clickhouse-keeper-settings
          configMap:
            name: wukong-clickhouse-keeper
            items:
              - key: keeper_config.xml
                path: keeper_config.xml
        - name: wukong-clickhouse-keeper-scripts
          configMap:
            name: wukong-clickhouse-keeper-scripts
            defaultMode: 0755
      containers:
        - name: wukong-clickhouse-keeper
          imagePullPolicy: IfNotPresent
          image: "ccr.ccs.tencentyun.com/wukong-common/clickhouse-keeper:23.10.5.20"
          resources:
            requests:
              memory: "256M"
              cpu: "100m"
            limits:
              memory: "4Gi"
              cpu: "1000m"
          volumeMounts:
            - name: wukong-clickhouse-keeper-settings
              mountPath: /etc/clickhouse-keeper/
            - name: wukong-clickhouse-keeper-scripts
              mountPath: /conf/
            - name: data
              mountPath: /var/lib/clickhouse-keeper
          command:
            - /conf/keeperStart.sh
          lifecycle:
            preStop:
              exec:
                command:
                  - /conf/keeperTeardown.sh
          livenessProbe:
            exec:
              command:
                - /conf/keeperLive.sh
            initialDelaySeconds: 60
            timeoutSeconds: 10
          readinessProbe:
            exec:
              command:
                - /conf/keeperReady.sh
            initialDelaySeconds: 60
            timeoutSeconds: 10
          ports:
            - containerPort: 2181
              name: client
              protocol: TCP
            - containerPort: 9234
              name: quorum
              protocol: TCP
            - containerPort: 9363
              name: metrics
              protocol: TCP
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName:  wukong-clickhouse-keeper-local-storage
      resources:
        requests:
          storage: 50Gi

liubo-it avatar Aug 03 '24 05:08 liubo-it

This problem also exists when I use heml chart

command helm install clickhouse-keeper --generate-name

link https://artifacthub.io/packages/helm/duyet/clickhouse-keeper?modal=install

screenshot image

image

liubo-it avatar Aug 05 '24 03:08 liubo-it

this is not officiall helm chart

did you run kubectl apply -n <namespace> -f https://github.com/Altinity/clickhouse-operator/blob/master/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml

only once?? or do something else?

Application: Code: 568. DB::Exception: At least one of servers should be able to start as leader (without <start_as_follower>)

try to execute on live pods

clickhouse-keeper client -q "get /keeper/config"
grep -C 10 start_as_follower -r /etc/clickhouse-keeper/

Slach avatar Aug 05 '24 04:08 Slach

@liubo-it could you check kubectl apply -n <namespace> -f https://github.com/Altinity/clickhouse-operator/blob/0.24.0/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml ?

Slach avatar Aug 12 '24 08:08 Slach

@liubo-it any news from your side?

Slach avatar Sep 09 '24 03:09 Slach

any news from your side?

That's still the case, so I went the other way

liubo-it avatar Sep 10 '24 03:09 liubo-it

Hi, I have the same problem when I customize the name of the resources like <prefix>-clickhouse-keeper. When I deploy the original file: https://github.com/Altinity/clickhouse-operator/blob/0.24.0/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml it deploys without problem.

linontano avatar Sep 17 '24 16:09 linontano

@linontano could you provide more context what exactly you tried to do? could you share? kubectl get chk -n <your-namespace> <prefix>-clickhouse-keeper -o yaml

Slach avatar Sep 17 '24 18:09 Slach

We are using a very slightly modified version of https://github.com/Altinity/clickhouse-operator/blob/0.23.7/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml and are also facing the same problem. The modification we've made are setting a custom namespace on all the resources in that file (both Services, both ConfigMaps and the StatefulSet)

@linontano could you provide more context what exactly you tried to do? could you share? kubectl get chk -n <your-namespace> <prefix>-clickhouse-keeper -o yaml

kubectl get chk will only list clickhousekeeper resources (created by the operator) -- in this case we won't have any of these since we're using that manual deployment manifest

janeklb avatar Sep 18 '24 14:09 janeklb

Switching to https://github.com/Altinity/clickhouse-operator/blob/0.24.0/deploy/clickhouse-keeper/clickhouse-keeper-manually/clickhouse-keeper-3-nodes.yaml (and applying the custom namespace modifications) fixes it for me

janeklb avatar Sep 18 '24 14:09 janeklb

I had the same issue. I was running older release for Altinity/clickhouse-operator and just had to do diff on my keeper resources vs the latest 0.24.0 release resources and patched my current deployment with the latest changes and it worked. There were only minor changes on the two configmap yamls. Thanks! @Slach @janeklb

utkarsh2811 avatar Oct 24 '24 08:10 utkarsh2811