incubator-toree fix kernel generation for Spark Yarn // TOREE-97

It looks like the TOREE-97 issue -- support for Spark Yarn was closed without definitive solution (or something went wrong on the way). Toree does support it, but it won't work if a user doesn't add manually in their kernel.json definition, the env vars for HADOOP_CONF_DIR. Without that env var, Spark doesn't know what to do with the option --master=yarn (set in __TOREE_SPARK_OPTS__). It would be desirable to have it by default, and this patch provides this functionality.

Probably this is not the nicest way to solve the problem, because it just hard codes more vars into the JSON file -- ideally it would be nice to have an interface to add or remove env vars from those files, however, HADOOP_CONF_DIR and SPARK_CONF_DIR look basic to be exported. Even for an Spark Standalone deployment, HADOOP_CONF_DIR won't hurt. So, here it goes our 2 cents to improve a bit the situation.

I cloned the TOREE-97 into TOREE-438 to sign this issue.

Sep 12 '17 13:09 ribamar-santarosa

There is a failure on the CI that doesn't look related to that patch:

failed to register layer: Error processing tar file(exit status 1): write /opt/conda/envs/python2/lib/python2.7/site-packages/Cython/Compiler/Code.so: no space left on device

Unless writing the paths of those 2 env vars are so big that is consuming all the storage! =)

Sep 12 '17 13:09 ribamar-santarosa

It would be also useful to export JAVA_HOME, in the case I do not want to use the default one, but a specific release.

Sep 12 '17 15:09 lammic

What is the difference with setting HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh ? More generically, why manage system wide configurations in the kernelspec?

Sep 12 '17 18:09 lresende

Good question. So, it's clear that, if Spark configuration aren't in the default location, the user needs to be able to inform Spark where they are -- for this, SPARK_CONF_DIR.

HADOOP_CONF_DIR is a bit trickier, because the standard idea is to think that Spark on Yarn is tied to a single instance of Hadoop -- so spark-env.sh suffices. Like anything in computing, somebody will try to expand a 1-1 relationship to 1-N -- we can make module load another_instance_of_hadoop , that will dynamically overwrite HADOOP_CONF_DIR . Then, we can go and install a Toree kernel for that tuple Hadoop-Spark.

Sep 13 '17 10:09 ribamar-santarosa

Maybe I am misunderstanding this, while with vanilla Toree you have the option to get your own local spark started (e.g. local[...]), when considering an enterprise environment where there is a large Spark cluster managed by Yarn you just need to connect to it, thus all these configuration being managed by spark and Hadoop configuration files directly. Also, in most of the enterprise deployments, the cluster is deployed based on some distribution which includes many other components, and we don't want to, and we don't need to, make Toree aware of them.

Anyway, what is the scenario you are trying to accomplish with these changes?

Sep 13 '17 14:09 lresende

", thus all these configuration being managed by spark and Hadoop configuration files directly." sure, but how do you tell Hadoop and Spark where to find the env.sh file, if they're not in the default location? With SPARK_CONF_DIR and HADOOP_CONF_DIR.

The scenario is very simple: in Bright Cluster Manager, users can have many Hadoop instances and many Spark instances. And they're able to connect any of those Spark instances with Yarn of any of those Hadoop instances. If there are many instances, there are many configurations files, and so, they cannot be in the default location, right? We could have a Jupyter/JupyterHub/Toree deployment (like we do with other tools) per Spark instance. But things are much simpler: we just need to add one kernel.json per Spark instance, with different SPARK_CONF_DIR and HADOOP_CONF_DIR. If those variables woudn't be there, all of those kernel.json would be identical, and then, how to tell which one to use?

Indeed, this PR is not really a requirement for us -- during our integration process, we already update the JSON files to contain those variables,so our user already benefit from the possibility of accessing a Toree kernel per Spark instance. We are just really trying to give back our 2 cents for the case there are Toree vanilla users trying to achieve the same without our product.

Sep 13 '17 14:09 ribamar-santarosa

@ribamar-santarosa Thanks for the explanation, I believe I understand the scenario now and have just minor comments on the changes.

Sep 13 '17 16:09 lresende