zookeeper-operator icon indicating copy to clipboard operation
zookeeper-operator copied to clipboard

All zK pods reports error "Zookeeper is not running"

Open priyavj08 opened this issue 3 years ago • 9 comments

Description

After new bring up of ZK cluster, version 0.2.12, I am seeing error zookeeper server is not running in all the ZK logs. When I exec in to the POD this command "echo ruok | nc 127.0.0.1 2181", it works fine , got output "imok" but looks like readiness/liveness probe failed for ZK-1 and Zk-2 pod failed once but it succeeded later on

attaching the logs zk-1-log.txt zk-2-log.txt zk-0-log.txt

Importance

Since this ZK cluster is used by kafka, it is a blocker issue for us

Location

ZK ensemble formation

Suggestions for an improvement

It doesn't seem to recover automatically

priyavj08 avatar Sep 06 '21 10:09 priyavj08

@priyavj08 , From the logs I am seeing that initially all the pods came up and later connection to first pod is broken.

2021-09-06 07:56:35,183 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 3
java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-09-06 07:56:35,186 [myid:3] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 3

anishakj avatar Sep 06 '21 10:09 anishakj

@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?

priyavj08 avatar Sep 06 '21 12:09 priyavj08

@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?

Not sure is there any specific issue in 3.6.1 base image of zookeeper. Is your pods are still in running state? Could you please post the describe output of pods. If that indicates readiness/liveness failure either can be due to check is taking longer time to execute in your setup

anishakj avatar Sep 06 '21 12:09 anishakj

I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.

here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3

priyavj08 avatar Sep 06 '21 13:09 priyavj08

I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.

here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3

Also try nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local and nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local from the zookeeper-0 pod and see if it is succeeding

anishakj avatar Sep 06 '21 14:09 anishakj

@anishakj nslookup works fine

nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Server: 10.96.0.10 Address: 10.96.0.10#53

Name: fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.87.108

root@fed-kafka-affirmedzk-0:/apache-zookeeper-3.6.1-bin# nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local zk1.txt zk2.txt zkop.txt zk0.txt

Server: 10.96.0.10 Address: 10.96.0.10#53

Name: fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.183.245

attaching new set of logs. pod describe seems fine for all ZK pods

priyavj08 avatar Sep 08 '21 05:09 priyavj08

@anishakj
all the ZK pods are showing this

echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests

after starting fine, how can it get in to this state?

priyavj08 avatar Sep 08 '21 09:09 priyavj08

@anishakj all the ZK pods are showing this

echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests

after starting fine, how can it get in to this state?

From the initial logs, it looks like connection is broken from first pod to second pod. Apart from that couldn't find any relevant details how it reached this state Also could you run ./zkCli.sh from a pod and see config is showing all the nodes

anishakj avatar Sep 08 '21 09:09 anishakj

The problem may be solved by https://github.com/apache/zookeeper/pull/1798

this is the ZOOKEEPER issue https://issues.apache.org/jira/browse/ZOOKEEPER-3988

eolivelli avatar Jan 20 '22 11:01 eolivelli