zookeeper-operator
zookeeper-operator copied to clipboard
Unable to deploy Zookeeper Cluster
Description
I'm trying to deploy a Pravega cluster to EKS but cannot get a zookeeper cluster running. I've deployed the zookeeper operator and zookeeper charts, the logs show no errors that I can see, but there are no zookeeper pods:
kubectl get zookeepercluster -n pravega
NAME REPLICAS READY REPLICAS VERSION DESIRED VERSION INTERNAL ENDPOINT EXTERNAL ENDPOINT AGE
zookeeper 3 0.2.15 3m37s
kubectl describe zookeepercluster/zookeeper -n pravega
Name: zookeeper
Namespace: pravega
Labels: app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=zookeeper
app.kubernetes.io/version=0.2.15
helm.sh/chart=zookeeper-0.2.15
Annotations: meta.helm.sh/release-name: zookeeper
meta.helm.sh/release-namespace: pravega
API Version: zookeeper.pravega.io/v1beta1
Kind: ZookeeperCluster
Metadata:
Creation Timestamp: 2023-05-10T15:12:04Z
Generation: 1
Resource Version: 5054825
Spec:
Config:
Pre Alloc Size: 16384
Image:
Repository: pravega/zookeeper
Tag: 0.2.15
Kubernetes Cluster Domain: cluster.local
Persistence:
Reclaim Policy: Delete
Spec:
Resources:
Requests:
Storage: 20Gi
Storage Class Name: gp3
Pod:
Service Account Name: zookeeper
Probes:
Liveness Probe:
Failure Threshold: 3
Initial Delay Seconds: 10
Period Seconds: 10
Timeout Seconds: 10
Readiness Probe:
Failure Threshold: 3
Initial Delay Seconds: 10
Period Seconds: 10
Success Threshold: 1
Timeout Seconds: 10
Replicas: 3
Storage Type: persistence
Events: <none>
kubectl get job -n pravega
NAME COMPLETIONS DURATION AGE
job.batch/zookeeper-post-install-upgrade 0/1 2m34s 2m34s
kubectl get pod-n pravega
NAME READY STATUS RESTARTS AGE
pod/nfs-server-provisioner-0 1/1 Running 0 6h16m
pod/pravega-operator-69f9b6fd48-86942 1/1 Running 0 6h15m
pod/zookeeper-operator-66f95cb4b9-xhzfr 1/1 Running 0 6h28m
pod/zookeeper-post-install-upgrade-4cbtf 0/1 Error 0 2m24s
pod/zookeeper-post-install-upgrade-gpd9b 0/1 Error 0 71s
pod/zookeeper-post-install-upgrade-lv8vp 0/1 Error 0 4m22s
pod/zookeeper-post-install-upgrade-pvbtc 0/1 Error 0 3m27s
kubectl get zookeepercluster -n pravega
NAME REPLICAS READY REPLICAS VERSION DESIRED VERSION INTERNAL ENDPOINT EXTERNAL ENDPOINT AGE
zookeeper 3 0.2.15 7m35s
kubectl logs replicaset.apps/zookeeper-operator-66f95cb4b9 -n pravega
{"level":"info","ts":1683708510.5929866,"logger":"cmd","msg":"zookeeper-operator Version: 0.2.14-16"}
{"level":"info","ts":1683708510.593027,"logger":"cmd","msg":"Git SHA: 28d1f69"}
{"level":"info","ts":1683708510.5930321,"logger":"cmd","msg":"Go Version: go1.19.7"}
{"level":"info","ts":1683708510.5930438,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
I0510 08:48:31.643734 1 request.go:601] Waited for 1.036627133s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/secrets.crossplane.io/v1alpha1?timeout=32s
time="2023-05-10T08:48:38Z" level=info msg="Leader lock zookeeper-operator-lock not found in namespace pravega"
{"level":"info","ts":1683708518.8782103,"logger":"leader","msg":"Trying to become the leader."}
I0510 08:48:41.679506 1 request.go:601] Waited for 2.794054996s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/mediapackage.aws.upbound.io/v1beta1?timeout=32s
{"level":"info","ts":1683708527.1470127,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1683708527.1528108,"logger":"leader","msg":"Became the leader."}
I0510 08:48:51.703887 1 request.go:601] Waited for 4.539519742s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/imagebuilder.aws.upbound.io/v1beta1?timeout=32s
{"level":"info","ts":1683708535.425317,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:6000"}
{"level":"info","ts":1683708535.425616,"logger":"cmd","msg":"starting manager"}
{"level":"info","ts":1683708535.426141,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"127.0.0.1:6000"}
{"level":"info","ts":1683708535.4262583,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1beta1.ZookeeperCluster"}
{"level":"info","ts":1683708535.4264324,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":1683708535.426441,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.Service"}
{"level":"info","ts":1683708535.4264479,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.Pod"}
{"level":"info","ts":1683708535.4264512,"msg":"Starting Controller","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster"}
{"level":"info","ts":1683708535.5296795,"msg":"Starting workers","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","worker count":1}
I wondered if storage was an issue, and have tried with both persistence
and ephemeral
options with no success. I have 3 nodes in my kubernetes node group (t3.medium's), which I assume is sufficient. The times shown above are short, but I've waited over an hour and still nothing.
How can I debug this?
Importance
Blocker.
Hi @junglie85 Did you check the error logs of post-install pods? or can you post the same ?
Hey @subhranil05 the logs aren't very helpful...
kubectl -n pravega logs pod/zookeeper-post-install-upgrade-tg6q7
Checking for ready ZK replicas
ZK replicas not ready
I think I've found the problem:
watchNamespace:
- pravega
It should be:
watchNamespace: pravega
Is there any reason why the chart doesn't accept the list of namespaces to watch as a yaml list and convert them to as string if needed?
{{ join "," .Values.watchNamespace }}
Hello @junglie85, Did you find a solution, I got same error, even setting namespace or empty string... zookeeper-0 pod not getting health.
I already have one zookeeper cluster. I tried to install another one in different namespace with watchNamespace but doesn't seem it honors the same.
When I tried to uninstall the operator, Its not uninstalling it and gives error about already zookeeper instances (of previous installation) are running.
We have contributed new guidelines to deploy Pravega on EKS: https://github.com/pravega/pravega/tree/master/deployment/aws-eks Once you deploy the cluster with a volume provisioner and with the right permissions, I had no problem deploying Zookeeper. Hope it helps.
I am trying on our own Kubernetes cluster.