cp-helm-charts
cp-helm-charts copied to clipboard
ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].
I'm trying to install Confluent Operator in our K8s cluster with Istio in it. Although the instruction wasn't included in their quick start guidelines, I'm hoping someone maybe came across with this problem.
Here's the steps I've done:
- Install the cluster with minimum requirements of 10 nodes
- Install
istio
components - Install the
operator
charts - Install
zookeeper
charts - Install
kafka
charts <--- Always failed
Errors based on kafka container:
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:45356, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:41440, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0
Tried some workaround with no luck.
- Install VirtualService to explicitly point to the
zookeeper
service
cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: zookeeper
spec:
hosts:
- zookeeper.operator.svc.cluster.local
http:
- name: zookeeper
route:
- destination:
host: zookeeper
---
EOF
- And also ServiceEntry for both
zookeeper
andkafka
cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: zookeeper
spec:
location: MESH_INTERNAL
hosts:
- zookeeper.operator.svc.cluster.local
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
---
EOF
cat <<EOF | kubectl -n operator apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: kafka
spec:
location: MESH_INTERNAL
hosts:
- kafka.operator.svc.cluster.local
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
---
EOF
Is it related to any configuration of Confluent helm charts?
Thanks!
Reference page: Confluent Operator Quick Start.
getting the same issue. were you able to solve this?
Exist update about this??, just today i start with problems with kafka.
ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server
I have 3 replicas from Zookeeper and Kafka but kafka-0 is off, on logs i see the error.
Seeing this more and more in our clusters now too, increasing CPU / memory for the zookeeper pods doesn't seem to help.
Yes, this is not a resource problem, i tested remove istio:
kubectl label namespace NAMESPACE istio-injection-
But if you use istio on your namespace i recommend move kafka another namespace.
NOTE: Change NAMESPACE
and reinstall kafka and this is working now, i think in the cluster exist components that block the communication, but i'm not sure.
Based on this discussion and a similar thread in the Confluent #ops Slack, I believe it is a bug in the version of Zookeeper that Confluent is using (3.5.8
) that is responsible.
We have successfully tested disabling cp-zookeeper in these charts, and running Apache Zookeeper 3.6.2
via the incubator helm charts as part of a confluent-apache hybrid cluster. So far we've had nearly a week of stable running, even after several routine zookeeper pod restarts.
I am facing the same issue, but it happened after few days (1 day). Not the time of deployment.
I have the same problem. It appears that when the zookeeper leader restarts, one of the followers successfully is promoted to leader, but the cp-kafka brokers are unable to connect to the new leader. I edited the zookeeper statefulset, changing the number of replicas from 3
to 0
, and then after saving and waiting for all of the zookeeper instances to close change the number of replicas back up to 3
. Once the zookeeper instances were running again, the brokers were once again able to connect to zookeeper and resume running. Editing the number of zookeeper replicas like this obviously isn't a fix to the problem.
@terryf82 Please could you give me more information about your workaround? Does It works successfully?
@marandalucas our hybrid setup has been working well for over a year now:
- deploy apache-zookeeper using those helm charts
- deploy confluent-platform with the
cp-zookeeper
chart disabled, and configure the other components (e.gcp-kafka
) to access zookeeper at the correct service address
I guess if you're asking this that still means there's a problem with cp-zookeeper
stability?
@marandalucas I disabled the confluent zookeeper and used apache zookeeper (helm3) from bitnami
- Disable confluent zookeeper chart:
values.yaml
:
cp-helm-charts:
# Disable confluent's zookeeper
cp-zookeeper:
enabled: false
...
# Also disable zookeeper within the broker config
cp-kafka:
....
cp-zookeeper:
enabled: false
...
# Configure bitnami's zookeeper (https://github.com/bitnami/charts/blob/master/bitnami/zookeeper/)
zookeeper:
image:
tag: 3.7.0-debian-10-r127
...
- Set the url of the new zookeeper service as a part of the install command:
helm repo add confluentinc https://confluentinc.github.io/cp-helm-charts/
helm repo add bitnami https://charts.bitnami.com/bitnami
...
helm upgrade \
...
--set cp-helm-charts.cp-kafka.cp-zookeeper.url=my-other-zookeeper:2181