zeppelin [ZEPPELIN-5760] fix passing configs to spark in k8s

What is this PR for?

Some important spark configs can`t be passed via SparkConf in k8s and in client mode including spark.jars.packages and spark.driver.extraJavaOptions This fix checks if these are set in interpreter configuration and passes them to spark-submit command

What type of PR is it?

Bug Fix

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-5760

How should this be tested?

updated test to test this fix

Questions:

Does the licenses files need to update? No
Is there breaking changes for older versions? No
Does this needs documentation? No

Jul 07 '22 13:07 zlosim

I think most Spark configuration values can be included in the environment variable SPARK_SUBMIT_OPTIONS. Please note that the K8sProcess works for all Zeppelin interpreters.

Jul 10 '22 17:07 Reamer

Hi @Reamer I'm sorry but I am not sure if I understand you correctly. As you stated, spark configs can be included in the SPARK_SUBMIT_OPTIONS but they were NOT. This fix specifically adds these configs to SPARK_SUBMIT_OPTIONS there are also two spark configs that CAN'T be passed in SPARK_SUBMIT_OPTIONS because of some functionality happening in 'interpreter.sh' thus they are passed in 'SPARK_DRIVER_EXTRAJAVAOPTIONS_CONF'

also this fix only changes behavior for spark only

Jul 11 '22 12:07 zlosim

HI @zlosim As you can see in the following lines, SPARK_SUBMIT_OPTIONS is concatenated with the spark configurations specifically for Kubernetes. https://github.com/apache/zeppelin/blob/8995ff227a44760b7dbb005a85bee4ac7772e1f2/zeppelin-plugins/launcher/k8s-standard/src/main/java/org/apache/zeppelin/interpreter/launcher/K8sRemoteInterpreterProcess.java#L323-L325

Your configuration values like spark.driver.extraClassPath and spark.driver.extraLibraryPath should take no effect, because Zeppelin starts Spark on Kubernetes always in client mode. https://github.com/apache/zeppelin/blob/8995ff227a44760b7dbb005a85bee4ac7772e1f2/zeppelin-plugins/launcher/k8s-standard/src/main/java/org/apache/zeppelin/interpreter/launcher/K8sRemoteInterpreterProcess.java#L410

You can also set SPARK_DRIVER_EXTRAJAVAOPTIONS_CONF directly in your Spark Zeppelin interpreter configuration if you want to override the (possibly empty) default setting.

In general, I can recommend you to create a config map that contains all Spark configurations and then include them in the containers via a custom interpreter-spec.yaml in your Spark Zeppelin interpreter.

Jul 11 '22 13:07 Reamer

Hi @Reamer , thanks for the clarification. Yes, one can set env vars and create custom interpreter-spec but this is not the same as this fix is trying to do and is not on feature parity with spark on yarn mode. Let me try to provide an example to better explain myself: when using zeppelin in multi-tenant environment with spark on hadoop cluster. one user can set their own dependencies and driver configurations via inline configuration or interpreter setting page and everything works as expected on the other hand, when setting dependencies or driver configs in k8s, these are ignored and only chance is to set them via SPARK_SUBMIT_OPTIONS as you mentioned, BUT

this will be fixed for every user and sometimes users would like to have different set of dependencies and driver configurations
as stated in zeppelin docs To be noticed, SPARK_SUBMIT_OPTIONS is deprecated and will be removed in future release.
in my opinion with this fix we can provide users with the same experience with spark running in k8s as running in hadoop cluster or stand alone mode

Your configuration values like spark.driver.extraClassPath and spark.driver.extraLibraryPath should take no effect, because Zeppelin starts Spark on Kubernetes always in client mode.

Yes, it is running in client mode, but driver is not yet started so there is a possibility to pass them to the driver on start so they can have effect. We are doing the same in SparkInterpreterLauncher

You can also set SPARK_DRIVER_EXTRAJAVAOPTIONS_CONF directly in your Spark Zeppelin interpreter configuration if you want to override the (possibly empty) default setting.

I'm sorry I was not aware of that. I know we can set few params with env vars but I can't find how to set env var from interpreter configuration

Jul 11 '22 17:07 zlosim

I understand your motivation. I just don't like the list of Spark configuration at this point, because it can become very long. Spark has very many configuration options and we can't map them all again here in Zeppelin. I am looking for a more generic approach. @zjffdu If SPARK_SUBMIT_OPTIONS is deprecated, how can we fill in the function?

Jul 15 '22 06:07 Reamer

@zlosim SparkInterpreterLauncher supports all the spark configurations. It detects all the spark configuration and concatenate them via env variable ZEPPELIN_SPARK_CONF. I think we can do similar thing in K8sRemoteInterpreterProcess.

The reason why we deprecate SPARK_SUBMIT_OPTIONS is we would like to encourage user to configure spark in Zeppelin via a unified approach like spark (just configure each property one by one), otherwise it may cause inconsistency (e.g. Zeppelin admin may configure spark.driver.memory via SPARK_SUBMIT_OPTIONS in zeppelin-env.sh, but another user also define it via inline configuration, It is hard for Zeppelin to detect this kind of conflicts). So even for ZEPPELIN_SPARK_CONF in SparkInterpreterLauncher, it is for internal usage, it is not exposed to zeppelin users.

Jul 19 '22 02:07 zjffdu

Hi @zjffdu thanks for the response. To be honest, I think this fix is doing exactly what you are saying you want to achieve. Right now setting e.g. spark dependencies via zeppelin interpreter settings is not working in K8s because we are trying to set it in runtime which is too late as driver is already started. This PR picks settings from interpreter settings passed by user that have to be passed directly to spark submit command when creating k8s pod with driver and puts them in env var so interpreter.sh script will start driver correctly. For end user this is what we are trying to achieve - he sets his settings via zeppelin interepreter settings and everything works as expected.

As for SPARK_SUBMIT_OPTIONS and possibility of zeppelin admin alter configs - we can alter the append and ignore value effectively leaving SPARK_SUBMIT_OPTIONS unexposed to zeppelin users - same as ZEPPELIN_SPARK_CONF.

Jul 19 '22 08:07 zlosim

@zlosim How about updating buildSparkSubmitOptions to what we did in SparkInterpreterLauncher. Just collect all the spark properties (start with prefix spark.) and concatenate them into ZEPPELIN_SPARK_CONF and pass it to interpreter.sh, instead of collecting each individual spark driver property explicitly.

Jul 19 '22 08:07 zjffdu

Hi @zjffdu , sorry for the late response. I rewrote it so we are now using ZEPPELIN_SPARK_CONF. Please let me know if there is anything I can do better.

Jul 25 '22 08:07 zlosim

@zlosim Any update？

Aug 10 '22 01:08 zjffdu

Hi @zjffdu , I think this PR is ready to merge unless there are any other requests

Aug 11 '22 16:08 zlosim

I would like to test this beforehand in my local K8s cluster, but I am still on vacation right now. Please have a little patience.

Aug 12 '22 06:08 Reamer

Hi @zlosim Sorry for the late reply. I tried to test your branch. Currently, I am getting a startup error because I have already fully transitioned my environment to Spark-3.3. Could you please do a rebase to the master.

Sep 23 '22 14:09 Reamer

Hi @Reamer no prob, should be rebased

Sep 27 '22 08:09 zlosim