cp-helm-charts
cp-helm-charts copied to clipboard
ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler) - Zookeper Cluster
Often enough to damage business our Zookeper memembers fail with "ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)", I think it happens on one member and it breaks the whole Zk cluster and Brokers after, it looks like https://issues.apache.org/jira/browse/ZOOKEEPER-3036 is related, why is this version in the release, what about upcoming Zookeeper releases, is there anything I can do about it?
Logs:
ZK-0 is [2020-07-03 03:40:28,552] WARN Exception when following the leader (org.apache.zookeeper.server.quorum.Learner)
java.net.SocketTimeoutException: Read timed out
ZK-1 [2020-07-02 10:54:17,681] WARN Unable to read additional data from client sessionid 0x200887c07d10004, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-07-03 03:40:45,745] INFO Expiring session 0x200887c07d10004, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x200887c07d10004 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0002, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0002 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0001, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0001 (org.apache.zookeeper.server.ZooKeeperServer)
### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! I think here everything starts
[2020-07-03 03:40:45,752] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
[2020-07-03 03:40:45,765] WARN ******* GOODBYE /10.233.106.50:54428 ******** (org.apache.zookeeper.server.quorum.LearnerHandler)
[2020-07-03 03:40:45,765] WARN Unexpected exception at LearnerHandler Socket[addr=/10.233.113.68,port=36912,localport=2888] tickOfNextAckDeadline:422901 synced?:true queuedPacketLength:2 (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Broken pipe (Write failed)
[2020-07-03 03:40:45,762] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Connection reset
ZK-2 - [2020-07-03 03:41:15,203] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
BR-0 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
BR-1 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
BR -2 - [2020-07-03 03:40:43,411] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection (edited)
Kafka and zookeeper image versions:
kubectl describe pods -n kafka-prod | grep -i image | grep -i zookeeper | tail -1
Normal Pulled 30m kubelet, prod-k8s-w1 Container image "confluentinc/cp-zookeeper:5.4.1" already present on machine
kubectl describe pods -n kafka-prod | grep -i image | grep -i cp-enterprise-kafka | tail -1
Normal Pulled 29m kubelet, prod-k8s-w1 Container image "confluentinc/cp-enterprise-kafka:5.4.1" already present on machine
Helm chart version:
helm list confluent-prod
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
confluent-prod 1 Tue May 5 16:53:20 2020 DEPLOYED cp-helm-charts-0.4.11.0 kafka-prod
Kubernetes version:
kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Can you please share the logs from all of the broker and zookeeper pods?
Can you please share the logs from all of the broker and zookeeper pods?
@liko9, sorry couldn't get the logs of the case. Today there was a new outage and thanks to kafka-exporter I saw that behind the error above and way before it something else is going seriously wrong with one of the Zookepers, it looks the one above is after a pod is restarted, but way before it with no crash in the pods I have no brokers available and many other problems. Please check logs in the initial post I have updated them.
Looks like this is fixed in 5.5.1 image, testing.
@angelovopsan Please could you give me more information? Did you solve the issue deploying the 5.5.1 image? I'm not sure that the trust problem is going to be solved, because all the zookeeper issues are open and there are not any change related in Zookeeper release notes:
- https://issues.apache.org/jira/browse/ZOOKEEPER-2938
- https://issues.apache.org/jira/browse/ZOOKEEPER-3036
Noticed this in confluentinc/cp-zookeeper:6.0.1 & confluentinc/cp-enterprise-kafka:6.0.1 PFA logs of all 3 zookeeper nodes logs.tar.gz
pod placement in K8s
kafka-cp-kafka-0 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-kafka-1 ip-172-31-80-72.us-west-2.compute.internal kafka-cp-kafka-2 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-kafka-exporter-6f45f59b99-rdxtp ip-172-31-80-72.us-west-2.compute.internal kafka-cp-zookeeper-0 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-zookeeper-1 ip-172-31-82-166.us-west-2.compute.internal kafka-cp-zookeeper-2 ip-172-31-165-143.us-west-2.compute.internal
Summary: Zookeeper-0: Couldnt connect to Zookeeper-2 Zookeeper-1: Couldnt connect to Zookeeper-2 and Zookeeper-0 Zookeeper-2 (continuously logs below lines): [2021-10-22 05:48:10,242] INFO Submitting global closeSession request for session 0x20047caaa57008f (org.apache.zookeeper.server.ZooKeeperServer) [2021-10-22 05:48:14,242] INFO Expiring session 0x20047caaa570090, timeout of 18000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
Fixed the issue for now with a rolling restart. Any help would be much appreciated.
I have the same problem with confluentinc/cp-zookeeper:5.5.2
. Unfortunately upgrading the version does not help, as @ahmed3465 explained in this issue, so I did not try that yet. It looks like that restarting pods in a specific way causes the Zookeeper ensemble to enter a corrupt state. As you describe, one Zookeeper pod then keeps spamming Submitting global closeSession request...
and Expiring session
log lines and this will break the entire Zookeeper cluster.
Is there a way to mitigate this issue? So far the only way to fix this for me was to scale down all Zookeeper pods in Kubernetes, then scale them back up.
Update, try to update to 5.5.6, it seems to fix it for us, but I have found on higher version of EKS 1.20 we don't see that much of it even without upgrade so it looks to be also somehow platform related, but the 5.5.6 does bump ZK version.
I had a similar issue with confluentinc/cp-zookeeper:6.1.0
. If I drained the kubernetes node that had the current zookeeper leader on it, the zookeeper cluster would enter this corrupt state after a quick leader election. The new leader only spammed Submitting global closeSession request...
and Expiring session
.
I tried adjusting maxSessionTimeouts and a whole lot of other stuff but couldn't get it fixed.
Finally an upgrade to confluentinc/cp-zookeeper:6.1.7
has seemed to do the trick for me.