pulsar-helm-chart icon indicating copy to clipboard operation
pulsar-helm-chart copied to clipboard

pulsar waiting to start: PodInitializing

Open freewaterfreewater opened this issue 3 years ago • 10 comments

[root@vm192-168-2-4 ~]# kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:08:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.21) and server (1.17) exceeds the supported minor version skew of +/-1

[root@vm192-168-2-4 ~]# kubectl get pod NAME READY STATUS RESTARTS AGE nfs-client-provisioner-6f664979c5-zvkbs 1/1 Running 0 98s nginx-86c57db685-5h4cd 1/1 Running 0 61m nginx-86c57db685-7vc7z 1/1 Running 0 62m pulsar-bookie-0 0/1 Init:0/1 0 5h38m pulsar-bookie-1 0/1 Init:0/1 0 5h38m pulsar-bookie-2 0/1 Init:0/1 0 5h38m pulsar-bookie-3 0/1 Init:0/1 0 5h38m pulsar-broker-0 0/1 Init:0/2 0 5h38m pulsar-broker-1 0/1 Init:0/2 0 5h38m pulsar-broker-2 0/1 Init:0/2 0 5h38m pulsar-grafana-74fbdb9cc6-6s7dc 1/1 Running 0 5h38m pulsar-prometheus-556c44f697-2mk2g 1/1 Running 0 5h38m pulsar-proxy-0 0/1 Init:0/2 0 5h38m pulsar-proxy-1 0/1 Init:0/2 0 5h38m pulsar-proxy-2 0/1 Init:0/2 0 5h38m pulsar-pulsar-manager-767d5f5766-nnpw9 1/1 Running 0 5h38m pulsar-recovery-0 0/1 Init:0/1 0 5h38m pulsar-toolset-0 1/1 Running 0 5h38m pulsar-zookeeper-0 1/1 Running 0 5h38m pulsar-zookeeper-1 1/1 Running 0 5h38m pulsar-zookeeper-2 1/1 Running 0 5h37m

[root@vm192-168-2-4 ~]# helm list WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION pulsar default 1 2021-05-31 12:08:47.649624912 +0800 CST deployed pulsar-2.7.1 2.7.1

[root@vm192-168-2-4 ~]# kubectl logs pulsar-broker-0 Error from server (BadRequest): container "pulsar-broker" in pod "pulsar-broker-0" is waiting to start: PodInitializing

freewaterfreewater avatar May 31 '21 09:05 freewaterfreewater

zookeeper pod commit # bin/bookkeeper shell metaformat pulsar-bookie pod is ok ,running

freewaterfreewater avatar Jun 03 '21 06:06 freewaterfreewater

commit #bin/pulsar zookeeper-shell -server pulsar-zookeeper get /admin/clusters/pulsar

06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.18.0-80.1.1.el7.ksyun.x86_64 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=root 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/pulsar 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=65MB 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=128MB 06:41:04.299 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=77MB 06:41:04.302 [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=pulsar-zookeeper sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@4ebff610 06:41:04.306 [main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation 06:41:04.311 [main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 10485760 Bytes 06:41:04.317 [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled= 06:41:04.325 [main-SendThread(pulsar-zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server pulsar-zookeeper/10.0.5.11:2181. Will not attempt to authenticate using SASL (unknown error) 06:41:04.329 [main-SendThread(pulsar-zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.0.5.11:50066, server: pulsar-zookeeper/10.0.5.11:2181 06:41:04.340 [main-SendThread(pulsar-zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server pulsar-zookeeper/10.0.5.11:2181, sessionid = 0x100176ed33aabd3, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /admin/clusters/pulsar

freewaterfreewater avatar Jun 03 '21 06:06 freewaterfreewater

I have the some issue

uiot-stack avatar Jun 09 '21 02:06 uiot-stack

First in values.yaml make sure that initialize: true Is set to true.

Also, this might or might not help:

Add this line to proxy-configmap.yaml under "data" section,

data: zookeeperServers: "{{ template "pulsar.zookeeper.connect" . }}{{ .Values.metadataPrefix }}"

provector avatar Jul 23 '21 23:07 provector

Kind of similar problem for me. Reinstalling a pulsar cluster chart=2.7.7 with initialize=true results in brokers / proxies getting stuck waiting for zookeeper.

image

I could track it down to incorrect arguments of pulsar-init-job. Here are some logs:

Exception in thread "main" com.beust.jcommander.ParameterException: The following options are required: [-cs | --configuration-store], [-uw | --web-service-url], [-zk | --zookeeper], [-c | --cluster]
	at com.beust.jcommander.JCommander.validateOptions(JCommander.java:388)
	at com.beust.jcommander.JCommander.parse(JCommander.java:357)
	at com.beust.jcommander.JCommander.parse(JCommander.java:335)
	at org.apache.pulsar.PulsarClusterMetadataSetup.main(PulsarClusterMetadataSetup.java:146)
Usage: <main class> [options]
  Options:
    -ub, --broker-service-url
      Broker-service URL for new cluster
    -tb, --broker-service-url-tls
      Broker-service URL for new cluster with TLS encryption
  * -c, --cluster
      Cluster name
  * -cs, --configuration-store
      Configuration Store connection string
    --existing-bk-metadata-service-uri
      The metadata service URI of the existing BookKeeper cluster that you 
      want to use
    -h, --help
      Show this help message
      Default: false
    --initial-num-stream-storage-containers
      Num storage containers of BookKeeper stream storage
      Default: 16
    --initial-num-transaction-coordinators
      Num transaction coordinators will assigned in cluster
      Default: 16
  * -uw, --web-service-url
      Web-service URL for new cluster
    -tw, --web-service-url-tls
      Web-service URL for new cluster with TLS encryption
  * -zk, --zookeeper
      Local ZooKeeper quorum connection string
    --zookeeper-session-timeout-ms
      Local zookeeper session timeout ms
      Default: 30000

sh: 4: --cluster: not found

Note that the init-job exited correctly because there is an || true that catches all errors. image

I guess that should be considered a problem, should it not?

mkoertgen avatar Jan 04 '22 10:01 mkoertgen

To me it looks like the problem has to do with helm-templates expanding empty lines without breaks (\) causing the shell call to fail. Here is what the helm renders

bin/pulsar initialize-cluster-metadata \

                --cluster pulsar \
                --zookeeper pulsar-zookeeper:2181 \
                --configuration-store pulsar-zookeeper:2181 \
                --web-service-url http://pulsar-broker.pulsar.svc.cluster.local:8080/ \
                --web-service-url-tls https://pulsar-broker.pulsar.svc.cluster.local:8443/ \
                --broker-service-url pulsar://pulsar-broker.pulsar.svc.cluster.local:6650/ \
                --broker-service-url-tls pulsar+ssl://pulsar-broker.pulsar.svc.cluster.local:6651/ || true;

I exed into the shell of one of the running bookies to reproduce the problem and yes, with the empty line it gives the error message noted above.

Manually calling the cluster initialization with the empty line removed helped in my case to get the cluster working again.

mkoertgen avatar Jan 04 '22 11:01 mkoertgen

Strange, because the helm-template looks fine to me. I have no idea where this extra line is coming from

  • https://github.com/apache/pulsar-helm-chart/blob/master/charts/pulsar/templates/pulsar-cluster-initialize.yaml#L92 Git blame tells no changes there for about the last 16 months.

mkoertgen avatar Jan 04 '22 11:01 mkoertgen

Probably, a suggestion for the future: Would it not be possible to default the initialize=true-value from Helm's builtin-objects like {{- if .Release.IsInstall }} ?

mkoertgen avatar Jan 04 '22 11:01 mkoertgen

To me it looks like the problem has to do with helm-templates expanding empty lines without breaks (\) causing the shell call to fail. Here is what the helm renders

bin/pulsar initialize-cluster-metadata \

                --cluster pulsar \
                --zookeeper pulsar-zookeeper:2181 \
                --configuration-store pulsar-zookeeper:2181 \
                --web-service-url http://pulsar-broker.pulsar.svc.cluster.local:8080/ \
                --web-service-url-tls https://pulsar-broker.pulsar.svc.cluster.local:8443/ \
                --broker-service-url pulsar://pulsar-broker.pulsar.svc.cluster.local:6650/ \
                --broker-service-url-tls pulsar+ssl://pulsar-broker.pulsar.svc.cluster.local:6651/ || true;

I exed into the shell of one of the running bookies to reproduce the problem and yes, with the empty line it gives the error message noted above.

Manually calling the cluster initialization with the empty line removed helped in my case to get the cluster working again.

This issue seems to be resolved by #166

lhotari avatar Jan 04 '22 18:01 lhotari

@lhotari Good point. I will check today with upgrading to 2.7.8.

mkoertgen avatar Jan 10 '22 08:01 mkoertgen

We had a similar issue, it turns out that broker-init and pulsar-init scripts run on an image that need at least 2.5Mi of memory. If the job is run on a cluster node with no sufficient memory, the initialization fails and the cluster setup hangs.

We added :

bookkeeper:
  metadata:
    resources:
      requests:
        memory: 2.5Gi
      limits:
        memory: 2.5Gi
pulsar_metadata::
  metadata:
    resources:
      requests:
        memory: 2.5Gi
      limits:
        memory: 2.5Gi

To make sure that:

  1. The job is assigned to a proper node (requests)
  2. The initialization fail fast when there is no node (limits)

michalcukierman avatar Jul 22 '23 09:07 michalcukierman

It looks like this is similar to https://github.com/apache/pulsar-helm-chart/issues/373 where containers that runs bin/bookeeper or bin/pulsar sets the jvm memory args to 2Gb. Each call to any of these scripts causes memory usage peak and may exceed container's memory.

marekczajkowski avatar Jul 25 '23 11:07 marekczajkowski

It still happens even after setting:

  resources:
    requests:
      memory: 3Gi
    limits:
      memory: 3Gi
Screenshot 2023-08-01 at 11 04 17

michalcukierman avatar Aug 01 '23 09:08 michalcukierman

Please retry with latest released version 3.2.0 . Please create a new issue if the problem persist.

lhotari avatar Jan 26 '24 21:01 lhotari