seldon-server icon indicating copy to clipboard operation
seldon-server copied to clipboard

Crash spark-worker-controller pod

Open ghost opened this issue 8 years ago • 6 comments

Hello, I have installed seldon in my local machine and now i am trying to run the Reuters Newswire Recommendation example but i have problems with spark-worker-controller and reuters-import-data pods. The problems start after running kubectl create -f import-data-job.json command. PS: I use a proxy to connect to the internet and i have added env variables for http_proxy and https_proxy

Can you help me please? Thank you in advance. Here are the logs of my pods: for spark-worker-controller: sed: -e expression #1, char 51: unknown option tos === Cannot resolve the DNS entry for spark-master. Has the service been created yet, and is SkyDNS functional? === See http://kubernetes.io/v1.1/docs/admin/dns.html for more details on DNS integration. === Sleeping 10s before pod exit.`

for reuters-import-data pod it stucks on ContainerCreating: WARNING:kazoo.client:Connection dropped: socket connection error: Name or service not known Traceback (most recent call last): File "/opt/conda/bin/seldon-cli", line 4, in <module> __import__('pkg_resources').run_script('seldon==2.0.0', 'seldon-cli') File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 742, in run_script File "/opt/conda/lib/python2.7/site-packages/setuptools-18.5-py2.7.egg/pkg_resources/__init__.py", line 1667, in run_script File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/EGG-INFO/scripts/seldon-cli", line 5, in <module> seldon.cli.start_seldoncli() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/__init__.py", line 3, in start_seldoncli cli_main.main() File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/cli_main.py", line 346, in main start_zk_client(opts) File "/opt/conda/lib/python2.7/site-packages/seldon-2.0.0-py2.7.egg/seldon/cli/cli_main.py", line 301, in start_zk_client gdata["zk_client"].start() File "/opt/conda/lib/python2.7/site-packages/kazoo/client.py", line 546, in start raise self.handler.timeout_exception("Connection time-out") kazoo.handlers.threading.KazooTimeoutError: Connection time-out connecting to zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181

ghost avatar Jun 23 '16 16:06 ghost

This looks like a DNS issue. How are you running Kubernetes?

ukclivecox avatar Jun 24 '16 08:06 ukclivecox

Yes I am running kubernetes. Normally if I can tape kubectl get nodes this means that my kubernetes is well running. Isn't it?

For more information, this is the result of kubectl get pods: kubectl get pods NAME READY STATUS RESTARTS AGE influxdb-grafana-xegq8 2/2 Running 0 1d k8s-etcd-127.0.0.1 1/1 Running 2 8d k8s-master-127.0.0.1 4/4 Running 0 8d k8s-proxy-127.0.0.1 1/1 Running 1 8d kafka-controller-hrrko 1/1 Running 94 16h memcached1-eo1ci 1/1 Running 0 1d memcached2-ol2mt 1/1 Running 0 1d mysql 1/1 Running 0 1d nginx-198147104-yq9p7 1/1 Running 0 8d reuters-import-data-j16r9 1/1 Running 0 43s seldon-control 1/1 Running 0 1d spark-master-controller-y5uoi 1/1 Running 0 16h spark-worker-controller-cqfgn 1/1 Running 166 16h spark-worker-controller-rcd98 1/1 Running 166 16h td-agent-server 1/1 Running 0 1d zookeeper-1 1/1 Running 0 1d zookeeper-2 1/1 Running 0 1d zookeeper-3 1/1 Running 0 1d

And after 1 minute it becomes:

NAME READY STATUS RESTARTS AGE influxdb-grafana-xegq8 2/2 Running 0 1d k8s-etcd-127.0.0.1 1/1 Running 2 8d k8s-master-127.0.0.1 4/4 Running 0 8d k8s-proxy-127.0.0.1 1/1 Running 1 8d kafka-controller-hrrko 1/1 Running 95 16h memcached1-eo1ci 1/1 Running 0 1d memcached2-ol2mt 1/1 Running 0 1d mysql 1/1 Running 0 1d nginx-198147104-yq9p7 1/1 Running 0 8d reuters-import-data-0pkmf 0/1 ContainerCreating 0 44s seldon-control 1/1 Running 0 1d spark-master-controller-y5uoi 1/1 Running 0 16h spark-worker-controller-cqfgn 0/1 CrashLoopBackOff 166 16h spark-worker-controller-rcd98 0/1 CrashLoopBackOff 166 16h td-agent-server 1/1 Running 0 1d zookeeper-1 1/1 Running 0 1d zookeeper-2 1/1 Running 0 1d zookeeper-3 1/1 Running 0 1d

ghost avatar Jun 24 '16 08:06 ghost

yes, but if you run kubernetes locally via Docker you need to start an internal DNS handler. Can you tell us how you installed Kubernetes - i.e. which of the ways described at http://kubernetes.io/docs/getting-started-guides/

If it was locally via Docker using http://kubernetes.io/docs/getting-started-guides/docker/ then you may need to setup DNS as described in http://kubernetes.io/docs/getting-started-guides/docker/#deploy-a-dns

ukclivecox avatar Jun 24 '16 08:06 ukclivecox

Hi,

I am running Kubernetes on top of Mesos. I've setup SkyDNS and the basic busybox test passes. However, the spark-workers are not able to resolve spark-master:

=== Cannot resolve the DNS entry for spark-master. Has the service been created yet, and is SkyDNS functional? === See http://kubernetes.io/v1.1/docs/admin/dns.html for more details on DNS integration. === Sleeping 10s before pod exit.

Do you have suggestions about fixing this issue?

Thanks, Bogdan

bghit avatar Jul 23 '16 12:07 bghit

We've not tried running on Mesos yet. Have you followed the DNS steps in http://kubernetes.io/docs/getting-started-guides/mesos/ ?

ukclivecox avatar Jul 23 '16 13:07 ukclivecox

Yes. I had an error in the SkyDNS config files. The workers connect to spark-master, but only workers that are co-located with the master get to run tasks. Remote workers seem to run CoarseGrainedExecutors, but they never execute tasks.

bghit avatar Jul 23 '16 15:07 bghit