cp-helm-charts icon indicating copy to clipboard operation
cp-helm-charts copied to clipboard

ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler) - Zookeper Cluster

Open s7an-it opened this issue 4 years ago • 8 comments

Often enough to damage business our Zookeper memembers fail with "ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)", I think it happens on one member and it breaks the whole Zk cluster and Brokers after, it looks like https://issues.apache.org/jira/browse/ZOOKEEPER-3036 is related, why is this version in the release, what about upcoming Zookeeper releases, is there anything I can do about it?

Logs:

ZK-0 is [2020-07-03 03:40:28,552] WARN Exception when following the leader (org.apache.zookeeper.server.quorum.Learner)
java.net.SocketTimeoutException: Read timed out


ZK-1 [2020-07-02 10:54:17,681] WARN Unable to read additional data from client sessionid 0x200887c07d10004, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-07-03 03:40:45,745] INFO Expiring session 0x200887c07d10004, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x200887c07d10004 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0002, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0002 (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Expiring session 0x300887c2cee0001, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2020-07-03 03:40:45,745] INFO Submitting global closeSession request for session 0x300887c2cee0001 (org.apache.zookeeper.server.ZooKeeperServer)
### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! I think here everything starts
[2020-07-03 03:40:45,752] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
[2020-07-03 03:40:45,765] WARN ******* GOODBYE /10.233.106.50:54428 ******** (org.apache.zookeeper.server.quorum.LearnerHandler)
[2020-07-03 03:40:45,765] WARN Unexpected exception at LearnerHandler Socket[addr=/10.233.113.68,port=36912,localport=2888] tickOfNextAckDeadline:422901 synced?:true queuedPacketLength:2 (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Broken pipe (Write failed)
[2020-07-03 03:40:45,762] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketException: Connection reset


ZK-2 - [2020-07-03 03:41:15,203] ERROR Unexpected exception causing shutdown while sock still open (org.apache.zookeeper.server.quorum.LearnerHandler)
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)


BR-0 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)


BR-1 - [2020-07-03 03:40:37,481] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)


BR -2 - [2020-07-03 03:40:43,411] ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection (edited) 

Kafka and zookeeper image versions:

kubectl describe pods -n kafka-prod | grep -i image |  grep -i zookeeper | tail -1
  Normal  Pulled     30m   kubelet, prod-k8s-w1  Container image "confluentinc/cp-zookeeper:5.4.1" already present on machine
kubectl describe pods -n kafka-prod | grep -i image |  grep -i cp-enterprise-kafka | tail -1
  Normal  Pulled     29m   kubelet, prod-k8s-w1  Container image "confluentinc/cp-enterprise-kafka:5.4.1" already present on machine

Helm chart version:

helm list confluent-prod
NAME            REVISION        UPDATED                         STATUS          CHART               APP VERSION     NAMESPACE
confluent-prod  1               Tue May  5 16:53:20 2020        DEPLOYED        cp-helm-charts-0.4.11.0             kafka-prod

Kubernetes version:

kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

s7an-it avatar Jun 22 '20 08:06 s7an-it

Can you please share the logs from all of the broker and zookeeper pods?

liko9 avatar Jun 22 '20 21:06 liko9

Can you please share the logs from all of the broker and zookeeper pods?

@liko9, sorry couldn't get the logs of the case. Today there was a new outage and thanks to kafka-exporter I saw that behind the error above and way before it something else is going seriously wrong with one of the Zookepers, it looks the one above is after a pod is restarted, but way before it with no crash in the pods I have no brokers available and many other problems. Please check logs in the initial post I have updated them.

s7an-it avatar Jul 03 '20 10:07 s7an-it

Looks like this is fixed in 5.5.1 image, testing.

s7an-it avatar Jul 11 '20 03:07 s7an-it

@angelovopsan Please could you give me more information? Did you solve the issue deploying the 5.5.1 image? I'm not sure that the trust problem is going to be solved, because all the zookeeper issues are open and there are not any change related in Zookeeper release notes:

  • https://issues.apache.org/jira/browse/ZOOKEEPER-2938
  • https://issues.apache.org/jira/browse/ZOOKEEPER-3036

marandalucas avatar Sep 29 '21 15:09 marandalucas

Noticed this in confluentinc/cp-zookeeper:6.0.1 & confluentinc/cp-enterprise-kafka:6.0.1 PFA logs of all 3 zookeeper nodes logs.tar.gz

pod placement in K8s

kafka-cp-kafka-0 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-kafka-1 ip-172-31-80-72.us-west-2.compute.internal kafka-cp-kafka-2 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-kafka-exporter-6f45f59b99-rdxtp ip-172-31-80-72.us-west-2.compute.internal kafka-cp-zookeeper-0 ip-172-31-165-143.us-west-2.compute.internal kafka-cp-zookeeper-1 ip-172-31-82-166.us-west-2.compute.internal kafka-cp-zookeeper-2 ip-172-31-165-143.us-west-2.compute.internal

Summary: Zookeeper-0: Couldnt connect to Zookeeper-2 Zookeeper-1: Couldnt connect to Zookeeper-2 and Zookeeper-0 Zookeeper-2 (continuously logs below lines): [2021-10-22 05:48:10,242] INFO Submitting global closeSession request for session 0x20047caaa57008f (org.apache.zookeeper.server.ZooKeeperServer) [2021-10-22 05:48:14,242] INFO Expiring session 0x20047caaa570090, timeout of 18000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)

Fixed the issue for now with a rolling restart. Any help would be much appreciated.

Thouqueer-Ahmed-ML avatar Oct 22 '21 09:10 Thouqueer-Ahmed-ML

I have the same problem with confluentinc/cp-zookeeper:5.5.2. Unfortunately upgrading the version does not help, as @ahmed3465 explained in this issue, so I did not try that yet. It looks like that restarting pods in a specific way causes the Zookeeper ensemble to enter a corrupt state. As you describe, one Zookeeper pod then keeps spamming Submitting global closeSession request... and Expiring session log lines and this will break the entire Zookeeper cluster.

Is there a way to mitigate this issue? So far the only way to fix this for me was to scale down all Zookeeper pods in Kubernetes, then scale them back up.

juv avatar Nov 23 '21 13:11 juv

Update, try to update to 5.5.6, it seems to fix it for us, but I have found on higher version of EKS 1.20 we don't see that much of it even without upgrade so it looks to be also somehow platform related, but the 5.5.6 does bump ZK version.

s7an-it avatar Jan 04 '22 11:01 s7an-it

I had a similar issue with confluentinc/cp-zookeeper:6.1.0. If I drained the kubernetes node that had the current zookeeper leader on it, the zookeeper cluster would enter this corrupt state after a quick leader election. The new leader only spammed Submitting global closeSession request... and Expiring session.

I tried adjusting maxSessionTimeouts and a whole lot of other stuff but couldn't get it fixed.

Finally an upgrade to confluentinc/cp-zookeeper:6.1.7 has seemed to do the trick for me.

Jefski14 avatar Sep 20 '22 14:09 Jefski14