zookeeper-operator
zookeeper-operator copied to clipboard
All zK pods reports error "Zookeeper is not running"
Description
After new bring up of ZK cluster, version 0.2.12, I am seeing error zookeeper server is not running in all the ZK logs. When I exec in to the POD this command "echo ruok | nc 127.0.0.1 2181", it works fine , got output "imok" but looks like readiness/liveness probe failed for ZK-1 and Zk-2 pod failed once but it succeeded later on
attaching the logs zk-1-log.txt zk-2-log.txt zk-0-log.txt
Importance
Since this ZK cluster is used by kafka, it is a blocker issue for us
Location
ZK ensemble formation
Suggestions for an improvement
It doesn't seem to recover automatically
@priyavj08 , From the logs I am seeing that initially all the pods came up and later connection to first pod is broken.
2021-09-06 07:56:35,183 [myid:3] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 3
java.io.EOFException
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-09-06 07:56:35,186 [myid:3] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 3
@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?
@anishakj how can I recover? Also, this doesn't happen all the time, so there is no network issue in the environment. could there be any timing issue in the product?
Not sure is there any specific issue in 3.6.1
base image of zookeeper. Is your pods are still in running state? Could you please post the describe output of pods. If that indicates readiness/liveness failure either can be due to check is taking longer time to execute in your setup
I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.
here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3
I noticed liveness and readiness probe failing in each of ZK-1 and Zk-2 pods but not ZK-0.
here is my setting for the probes Liveness: exec [zookeeperLive.sh] delay=30s timeout=30s period=40s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=30s timeout=30s period=40s #success=1 #failure=3
Also try nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
and nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
from the zookeeper-0 pod and see if it is succeeding
@anishakj nslookup works fine
nslookup fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Server: 10.96.0.10 Address: 10.96.0.10#53
Name: fed-kafka-affirmedzk-1.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.87.108
root@fed-kafka-affirmedzk-0:/apache-zookeeper-3.6.1-bin# nslookup fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local zk1.txt zk2.txt zkop.txt zk0.txt
Server: 10.96.0.10 Address: 10.96.0.10#53
Name: fed-kafka-affirmedzk-2.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local Address: 192.168.183.245
attaching new set of logs. pod describe seems fine for all ZK pods
@anishakj
all the ZK pods are showing this
echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests
after starting fine, how can it get in to this state?
@anishakj all the ZK pods are showing this
echo stat |nc 127.0.0.1 2181 This ZooKeeper instance is not currently serving requests
after starting fine, how can it get in to this state?
From the initial logs, it looks like connection is broken from first pod to second pod. Apart from that couldn't find any relevant details how it reached this state
Also could you run ./zkCli.sh from a pod and see config
is showing all the nodes
The problem may be solved by https://github.com/apache/zookeeper/pull/1798
this is the ZOOKEEPER issue https://issues.apache.org/jira/browse/ZOOKEEPER-3988