dcos-zeppelin
dcos-zeppelin copied to clipboard
Upgrade to Zeppelin 0.7.0.
It's not clear to me how important the package/*
files are here vs. those in https://github.com/mesosphere/universe/, but I just submitted a PR to add Zeppelin 0.7.0 there, https://github.com/mesosphere/universe/pull/1016. If the files here are important, then the corresponding updates are in my fork of this repo: https://github.com/deanwampler/dcos-zeppelin/tree/zeppelin-0.7.0
This is what I see after deploying this PR to a DC/OS cluster after opening Zeppelin server and logging in as "guest"
No error message was shown but the execution failed.
The commit for guest access in shiro should not be there, sorry about that. Let me remove it. I am also confused by the package information that is both on universe and in the docker repo.
The package/package.json and package/resource.json will be base64 encoded into the DCOS_PACKAGE_METADATA for DC/OS (specifically the Marathon) to display info. The package/marathon.json.mustache determines how Marathon will run container and what environment variables to set for the container. The package/config.json is what you can set through the Marathon package setup interface, such as docker image for Zeppelin service, environment variables you want Marathon passes into container. Therefore, the pcakge/config.json and the package/marathon.json.mustache come hand in hand.
I forked the dcos-zeppelin project (https://github.com/jshenguru/dcos-zeppelin) and change from package-based to Marathon app based manner and made a zeppelin 0.7.0 docker image (https://hub.docker.com/r/jshenguru/zeppelin/). There is no need for wait anymore.
I'm trying hard to get this to work properly. If I run a simple notebook with these two lines (which work perfectly well on 0.5.6/0.6.0):
val c = sc.parallelize(Array((1,2),(2,4)))
c.count()
It gets stuck saying it can't find resources:
WARN [2017-04-03 09:20:11,420] ({Timer-0} Logging.scala[logWarning]:70) - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
and Mesos task error says:
Running docker -H unix:///var/run/docker.sock run --cpu-shares 2048 --memory 9448718336 -e SPARK_EXECUTOR_OPTS= -e SPARK_USER=root -e SPARK_EXECUTOR_MEMORY=8192m -e LIBPROCESS_IP=10.170.15.6 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e
/bin/sh: 1: /mnt/mesos/sandbox/spark-1.6.3-1-bin-2.6/bin/spark-class: not found
Highlighted the relevant parts: where is it getting the idea I have 2048 cores and 94GB of RAM? It gets stuck forever at the c.count() part
@jshenguru In the screenshot you attached, did you also test Spark SQL code? I tested your image on a DC/OS cluster but the sql part is stuck in "RUNNING 0%" status.
@deanwampler @ebastien @jshenguru Yes, none of these docker images work properly with spark, especially in distributed mode.
@yuchaoran2011 Yes, I ran through the whole Zeppelin 0.7.0 Spark tutorial code with success on DC/OS 1.8.8 with Spark 2.1.0. After painful debug, I found out that the trick is to remove all three Hadoop config files at the /etc/hadoop/.xml. You can add the "rm /etc/hadoop/.xml" into the cmd section of the zeppelin-0.7.0.json file that I created at https://github.com/jshenguru/dcos-zeppelin. The *.xml files form Mesosphere github are out of date, I am looking into their official docker images to see what right settings on hadoop should be on DC/OS 1.8 so that I can fix that elegantly.
@ZDi-it The docker image that I built is working. You can pull it from https://hub.docker.com/r/jshenguru/zeppelin/. The only hack you need to do is to remove the /etc/hadoop/.xml which are way out of date. You can add "rm /etc/hadoop/.xml" into the cmd section of the Marathon app config file that I created at https://github.com/jshenguru/dcos-zeppelin. I am looking into their official HDFS images to see what the right settings should be now and fix it elegantly.
That's not a trick. That's a huge loss of functionality. I'm getting .jars and parquet files from hdfs.
@ZDi-it I am using AWS S3A for files, so that trick works for me. But I get your point and I am looking into Mesosphere HDFS stuff to see what the updated settings should be in the hdfs-site.xml. Once I find out the correct settings, I will update the docker and let you guys know. Stay tuned.
@jshenguru I can confirm that by adding the *-site.xml files via fetch and then copying from the sandbox to /etc/hadoop gets HDFS to work. Spark/scala is doing parallelization as well.
Actually your docker is the only one that works. I have added R and matplotlib support, I will do some testing and post here again.
@eolix Thank you for your feedback on my docker image as well as your inputs on the *-site.xml.
I looked into the Mesosphere official Spark docker images and "stole" the following two lines from the conf/spark-env.sh into the zeppelin-env.sh.
[ -f "${MESOS_SANDBOX}/hdfs-site.xml" ] && cp "${MESOS_SANDBOX}/hdfs-site.xml" "${HADOOP_CONF_DIR}" [ -f "${MESOS_SANDBOX}/core-site.xml" ] && cp "${MESOS_SANDBOX}/core-site.xml" "${HADOOP_CONF_DIR}"
That way, if you have HDFS deployed on your DC/OS, this will grab proper config files that are generated by DC/OS. If you don't have HDFS deployed, nothing gets into the /etc/hadoop (this is what I need for my case).
I've fully updated my Github at (https://github.com/jshenguru/dcos-zeppelin), rebuilt the dcos-zeppelin docker image (I decided to change the name to dcos-zeppelin to avoid confusion) and uploaded to the Docker hub (https://hub.docker.com/r/jshenguru/dcos-zeppelin/).
I did a quick test using the tutorial code and everything works fine.
I did all this because I don't know how long it will take to get all these changes merged into the Mesosphere official repository. I just cannot wait.
@jshenguru I've been pressuring DCOS support for awhile now... I'll test your new docker tomorrow morning, but I guess it should be running without glitches.
I have installed R and matplotlib inside your docker with excellent results, I'll update on that as well tomorrow.
This is @jshenguru 's image modified to include R and matplotlib (of course, not all R plugins are present, but you could install them if you wish so)
https://hub.docker.com/r/skydap/zeppelin-experimental/tags/
skydap/zeppelin-experimental:0.7.0
@eolix Do you mind sharing the Dockerfile?
@jshenguru sure
FROM mesosphere/mesos:1.1.0-2.0.107.ubuntu1404
ENV DEBIAN_FRONTEND noninteractive
RUN echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list \
&& gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 \
&& gpg -a --export E084DAB9 | sudo apt-key add - \
&& apt-get update \
&& apt-get install -y software-properties-common \
&& add-apt-repository ppa:openjdk-r/ppa \
&& apt-get update \
&& apt-get install -y openjdk-8-jdk tar curl wget R-base libssl-dev
# hadoop config
ENV HADOOP_CONF_DIR /etc/hadoop
# ADD ./conf/hadoop/hdfs-site.xml /etc/hadoop/hdfs-site.xml
# ADD ./conf/hadoop/core-site.xml /etc/hadoop/core-site.xml
# ADD ./conf/hadoop/mesos-site.xml /etc/hadoop/mesos-site.xml
# java
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
RUN update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
# R installation
RUN cd /tmp && \
R -e "install.packages('devtools', repos = 'http://cran.us.r-project.org')" && \
R -e "install.packages('knitr', repos = 'http://cran.us.r-project.org')" && \
R -e "install.packages('ggplot2', repos = 'http://cran.us.r-project.org')" && \
R -e "install.packages('rJava', repos = 'http://cran.us.r-project.org')" && \
R -e "install.packages('RJDBC', repos = 'http://cran.us.r-project.org')" &&\
wget https://cran.r-project.org/src/contrib/magrittr_1.5.tar.gz && \
wget https://cran.r-project.org/src/contrib/stringr_1.2.0.tar.gz && \
wget https://cran.r-project.org/src/contrib/evaluate_0.10.tar.gz && \
wget https://cran.r-project.org/src/contrib/knitr_1.15.1.tar.gz && \
wget https://github.com/Rexamine/stringi/archive/master.zip && \
unzip master.zip && \
R CMD INSTALL stringi-master && \
R CMD INSTALL magrittr_1.5.tar.gz && \
R CMD INSTALL stringr_1.2.0.tar.gz && \
R CMD INSTALL evaluate_0.10.tar.gz && \
R CMD INSTALL knitr_1.15.1.tar.gz
#matplotlib
RUN apt-get install -y python-matplotlib && \
mkdir -p ~/.config/matplotlib && \
echo "backend : Agg" >> ~/.config/matplotlib/matplotlibrc
# zeppelin
RUN wget http://www.apache.org/dist/zeppelin/zeppelin-0.7.0/zeppelin-0.7.0-bin-all.tgz
RUN tar xzvf zeppelin-0.7.0-bin-all.tgz
WORKDIR /zeppelin-0.7.0-bin-all
ADD zeppelin-env.sh conf/zeppelin-env.sh
CMD ["bin/zeppelin.sh", "start"]
@jshenguru Did you try it?
@eolix I haven't yet. I am thinking about making three more mesos dockers: one for full R packages, one for full Python packages, the last for all R + Python packages. I will "steal" things from your example, hope you don't feel that I am "compete" with you. To make docker image size smaller, I plan to start from scratch using a smaller Ubuntu base image.
@jshenguru compete? Absolutely not! I'd love to finally see zeppelin 0.7.x with R on the DCOS universe!