cp-helm-charts ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].

I'm trying to install Confluent Operator in our K8s cluster with Istio in it. Although the instruction wasn't included in their quick start guidelines, I'm hoping someone maybe came across with this problem.

Here's the steps I've done:

Install the cluster with minimum requirements of 10 nodes
Install istio components
Install the operator charts
Install zookeeper charts
Install kafka charts <--- Always failed

Errors based on kafka container:

[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:45356, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)

[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:41440, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0

Tried some workaround with no luck.

Install VirtualService to explicitly point to the zookeeper service

cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: zookeeper
spec:
  hosts:
  - zookeeper.operator.svc.cluster.local
  http:
  - name: zookeeper
    route:
    - destination:
        host: zookeeper
---
EOF

And also ServiceEntry for both zookeeper and kafka

cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: zookeeper
spec:
  location: MESH_INTERNAL
  hosts:
  - zookeeper.operator.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
---
EOF


cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: kafka
spec:
  location: MESH_INTERNAL
  hosts:
  - kafka.operator.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
---
EOF

Is it related to any configuration of Confluent helm charts?

Thanks!

Reference page: Confluent Operator Quick Start.

Feb 26 '20 17:02 tamipangadil

getting the same issue. were you able to solve this?

Aug 01 '20 21:08 suraj2410

Exist update about this??, just today i start with problems with kafka.

ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server

I have 3 replicas from Zookeeper and Kafka but kafka-0 is off, on logs i see the error.

Oct 15 '20 04:10 sanvir10

Seeing this more and more in our clusters now too, increasing CPU / memory for the zookeeper pods doesn't seem to help.

Oct 21 '20 00:10 terryf82

Yes, this is not a resource problem, i tested remove istio:

kubectl label namespace NAMESPACE istio-injection-

But if you use istio on your namespace i recommend move kafka another namespace.

NOTE: Change NAMESPACE

and reinstall kafka and this is working now, i think in the cluster exist components that block the communication, but i'm not sure.

Oct 21 '20 03:10 sanvir10

Based on this discussion and a similar thread in the Confluent #ops Slack, I believe it is a bug in the version of Zookeeper that Confluent is using (3.5.8) that is responsible.

We have successfully tested disabling cp-zookeeper in these charts, and running Apache Zookeeper 3.6.2 via the incubator helm charts as part of a confluent-apache hybrid cluster. So far we've had nearly a week of stable running, even after several routine zookeeper pod restarts.

Oct 27 '20 05:10 terryf82

I am facing the same issue, but it happened after few days (1 day). Not the time of deployment.

Apr 05 '21 05:04 nrvmodi

I have the same problem. It appears that when the zookeeper leader restarts, one of the followers successfully is promoted to leader, but the cp-kafka brokers are unable to connect to the new leader. I edited the zookeeper statefulset, changing the number of replicas from 3 to 0, and then after saving and waiting for all of the zookeeper instances to close change the number of replicas back up to 3. Once the zookeeper instances were running again, the brokers were once again able to connect to zookeeper and resume running. Editing the number of zookeeper replicas like this obviously isn't a fix to the problem.

Aug 20 '21 10:08 matt-best-elateral

@terryf82 Please could you give me more information about your workaround? Does It works successfully?

Oct 28 '21 12:10 marandalucas

@marandalucas our hybrid setup has been working well for over a year now:

deploy apache-zookeeper using those helm charts
deploy confluent-platform with the cp-zookeeper chart disabled, and configure the other components (e.g cp-kafka) to access zookeeper at the correct service address

I guess if you're asking this that still means there's a problem with cp-zookeeper stability?

Oct 28 '21 21:10 terryf82

@marandalucas I disabled the confluent zookeeper and used apache zookeeper (helm3) from bitnami

Disable confluent zookeeper chart: values.yaml:

cp-helm-charts:
  # Disable confluent's zookeeper
  cp-zookeeper:
    enabled: false
...
  # Also disable zookeeper within the broker config
  cp-kafka:
    ....
    cp-zookeeper:
      enabled: false
  ...
# Configure bitnami's zookeeper (https://github.com/bitnami/charts/blob/master/bitnami/zookeeper/)
zookeeper:
  image:
    tag: 3.7.0-debian-10-r127
...

Set the url of the new zookeeper service as a part of the install command:

helm repo add confluentinc https://confluentinc.github.io/cp-helm-charts/
helm repo add bitnami https://charts.bitnami.com/bitnami
...
helm upgrade \
...
--set cp-helm-charts.cp-kafka.cp-zookeeper.url=my-other-zookeeper:2181

Oct 28 '21 21:10 matt-best-elateral

cp-helm-charts cp-helm-charts copied to clipboard

ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].

cp-helm-charts
cp-helm-charts copied to clipboard