zookeeper-operator All zK pods reports error "Zookeeper is not running"

Description

After new bring up of ZK cluster, version 0.2.12, I am seeing error zookeeper server is not running in all the ZK logs. When I exec in to the POD this command "echo ruok | nc 127.0.0.1 2181", it works fine , got output "imok" but looks like readiness/liveness probe failed for ZK-1 and Zk-2 pod failed once but it succeeded later on

attaching the logs zk-1-log.txt zk-2-log.txt zk-0-log.txt

Importance

Since this ZK cluster is used by kafka, it is a blocker issue for us

Location

ZK ensemble formation

Suggestions for an improvement

It doesn't seem to recover automatically

Sep 06 '21 10:09 priyavj08

@priyavj08 , From the logs I am seeing that initially all the pods came up and later connection to first pod is broken.

2021-09-06 07:56:35,183 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 3
java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-09-06 07:56:35,186 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 3

Sep 06 '21 10:09 anishakj

@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?

Sep 06 '21 12:09 priyavj08

@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?

Not sure is there any specific issue in 3.6.1 base image of zookeeper. Is your pods are still in running state? Could you please post the describe output of pods. If that indicates readiness/liveness failure either can be due to check is taking longer time to execute in your setup

Sep 06 '21 12:09 anishakj

I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.

here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3

Sep 06 '21 13:09 priyavj08

I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.

here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3

Also try nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local and nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local from the zookeeper-0 pod and see if it is succeeding

Sep 06 '21 14:09 anishakj

@anishakj nslookup works fine

nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Server: 10.96.0.10 Address: 10.96.0.10#53

Name: fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.87.108

root@fed-kafka-affirmedzk-0:/apache-zookeeper-3.6.1-bin# nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local zk1.txt zk2.txt zkop.txt zk0.txt

Server: 10.96.0.10 Address: 10.96.0.10#53

Name: fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.183.245

attaching new set of logs. pod describe seems fine for all ZK pods

Sep 08 '21 05:09 priyavj08

@anishakj
all the ZK pods are showing this

echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests

after starting fine, how can it get in to this state?

Sep 08 '21 09:09 priyavj08

@anishakj all the ZK pods are showing this

echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests

after starting fine, how can it get in to this state?

From the initial logs, it looks like connection is broken from first pod to second pod. Apart from that couldn't find any relevant details how it reached this state Also could you run ./zkCli.sh from a pod and see config is showing all the nodes

Sep 08 '21 09:09 anishakj

The problem may be solved by https://github.com/apache/zookeeper/pull/1798

this is the ZOOKEEPER issue https://issues.apache.org/jira/browse/ZOOKEEPER-3988

Jan 20 '22 11:01 eolivelli

zookeeper-operator zookeeper-operator copied to clipboard

All zK pods reports error "Zookeeper is not running"

Description

Importance

Location

Suggestions for an improvement

zookeeper-operator
zookeeper-operator copied to clipboard