zeppelin
zeppelin copied to clipboard
[ZEPPELIN-4409] Set spark.app.name to know which user is running which notebook
What is this PR for?
This PR sets spark.app.name to "Zeppelin Notebook xxxxx for user test" which makes it helpful to know which notebooks are currently being used and how much resources are being allocated by which user and notebook.
What type of PR is it?
Feature
What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-4409
How should this be tested?
- Create a notebook with spark interpreter
- Run println("hello")
- spark app name will be set to user name and note id
Screenshots (if appropriate)
Questions:
- Does the licenses files need update? No
- Is there breaking changes for older versions? No
- Does this needs documentation? No
Thanks for the contribution @amakaur
It makes sense to include user name and note id in the spark app name. But one easier approach is to set spark.app.name
in SparkInterpreterLauncher
, you can set spark.app.name
to be interpreterGroupId
which is consisted of note id and user name. Here's the logic how interpreterGroupId
is generated. https://github.com/apache/zeppelin/blob/master/zeppelin-zengine/src/main/java/org/apache/zeppelin/interpreter/InterpreterSetting.java#L406
Thanks for the suggestion @zjffdu , i tried setting spark.app.name in SparkInterpreterLauncher#buildEnvFromProperties through env variable "SPARK_APP_NAME" and via spark property "spark.app.name", but didn't seem to work. It sets spark.app.name to default value "Zeppelin".
@amakaur How do you set that ? Can you show me the code ?
@zjffdu here's two ways that i tried: - by either setting SPARK_APP_NAME OR by setting spark property spark.app.name
public Map<String, String> buildEnvFromProperties(InterpreterLaunchContext context) throws IOException {
Map<String, String> env = super.buildEnvFromProperties(context);
Properties sparkProperties = new Properties();
String sparkMaster = getSparkMaster(properties);
for (String key : properties.stringPropertyNames()) {
if (RemoteInterpreterUtils.isEnvString(key)) {
env.put(key, properties.getProperty(key));
}
if (isSparkConf(key, properties.getProperty(key))) {
sparkProperties.setProperty(key, toShellFormat(properties.getProperty(key)));
}
}
env.put("SPARK_APP_NAME", context.getInterpreterGroupId() + "_testing");
//sparkProperties.setProperty("spark.app.name", context.getInterpreterGroupId() + "_testing");
setupPropertiesForPySpark(sparkProperties);
setupPropertiesForSparkR(sparkProperties);
if (isYarnMode() && getDeployMode().equals("cluster")) {
env.put("ZEPPELIN_SPARK_YARN_CLUSTER", "true");
sparkProperties.setProperty("spark.yarn.submit.waitAppCompletion", "false");
}
StringBuilder sparkConfBuilder = new StringBuilder();
if (sparkMaster != null) {
sparkConfBuilder.append(" --master " + sparkMaster);
}
if (isYarnMode() && getDeployMode().equals("cluster")) {
if (sparkProperties.containsKey("spark.files")) {
sparkProperties.put("spark.files", sparkProperties.getProperty("spark.files") + "," +
zConf.getConfDir() + "/log4j_yarn_cluster.properties");
} else {
sparkProperties.put("spark.files", zConf.getConfDir() + "/log4j_yarn_cluster.properties");
}
sparkProperties.put("spark.yarn.maxAppAttempts", "1");
}```
@amakaur SPARK_APP_NAME
is not valid, you need to use spark.app.name
which is what spark expect ( refer here )
I tried that, and it works for me
@zjffdu Thanks for trying it out. Also, let me try again w/ spark.app.name, but i had spark.app.name commented out in the code that i posted just to show two different ways i was using to get it to work https://github.com/apache/zeppelin/blob/master/spark/interpreter/src/main/resources/interpreter-setting.json#L30
public Map<String, String> buildEnvFromProperties(InterpreterLaunchContext context) throws IOException { Map<String, String> env = super.buildEnvFromProperties(context); Properties sparkProperties = new Properties(); String sparkMaster = getSparkMaster(properties); for (String key : properties.stringPropertyNames()) { if (RemoteInterpreterUtils.isEnvString(key)) { env.put(key, properties.getProperty(key)); } if (isSparkConf(key, properties.getProperty(key))) { sparkProperties.setProperty(key, toShellFormat(properties.getProperty(key))); } }
//env.put("SPARK_APP_NAME", context.getInterpreterGroupId() + "_testing");
sparkProperties.setProperty("spark.app.name", context.getInterpreterGroupId() + "_testing");
setupPropertiesForPySpark(sparkProperties);
setupPropertiesForSparkR(sparkProperties);
if (isYarnMode() && getDeployMode().equals("cluster")) {
env.put("ZEPPELIN_SPARK_YARN_CLUSTER", "true");
sparkProperties.setProperty("spark.yarn.submit.waitAppCompletion", "false");
}
StringBuilder sparkConfBuilder = new StringBuilder();
if (sparkMaster != null) {
sparkConfBuilder.append(" --master " + sparkMaster);
}
if (isYarnMode() && getDeployMode().equals("cluster")) {
if (sparkProperties.containsKey("spark.files")) {
sparkProperties.put("spark.files", sparkProperties.getProperty("spark.files") + "," +
zConf.getConfDir() + "/log4j_yarn_cluster.properties");
} else {
sparkProperties.put("spark.files", zConf.getConfDir() + "/log4j_yarn_cluster.properties");
}
sparkProperties.put("spark.yarn.maxAppAttempts", "1");
}```
Still not working for me, but will debug to see if i can find something.
public Map<String, String> buildEnvFromProperties(InterpreterLaunchContext context) throws IOException {
Map<String, String> env = super.buildEnvFromProperties(context);
Properties sparkProperties = new Properties();
String sparkMaster = getSparkMaster(properties);
for (String key : properties.stringPropertyNames()) {
if (RemoteInterpreterUtils.isEnvString(key)) {
env.put(key, properties.getProperty(key));
}
if (isSparkConf(key, properties.getProperty(key))) {
sparkProperties.setProperty(key, toShellFormat(properties.getProperty(key)));
}
}
sparkProperties.setProperty("spark.app.name", context.getInterpreterGroupId());
setupPropertiesForPySpark(sparkProperties);
setupPropertiesForSparkR(sparkProperties);
if (isYarnMode() && getDeployMode().equals("cluster")) {
env.put("ZEPPELIN_SPARK_YARN_CLUSTER", "true");
sparkProperties.setProperty("spark.yarn.submit.waitAppCompletion", "false");
}
StringBuilder sparkConfBuilder = new StringBuilder();
if (sparkMaster != null) {
sparkConfBuilder.append(" --master " + sparkMaster);
}
if (isYarnMode() && getDeployMode().equals("cluster")) {
if (sparkProperties.containsKey("spark.files")) {
sparkProperties.put("spark.files", sparkProperties.getProperty("spark.files") + "," +
zConf.getConfDir() + "/log4j_yarn_cluster.properties");
} else {
sparkProperties.put("spark.files", zConf.getConfDir() + "/log4j_yarn_cluster.properties");
}
sparkProperties.put("spark.yarn.maxAppAttempts", "1");
}```
Try this branch which works for me. https://github.com/zjffdu/zeppelin/commit/47a50c0ae73c9fc8c5441572cee0d9164106a60f
And set spark.app.name to be empty in interpreter setting page first.
Sure, let me try.
After debugging i realized that the reason my changes weren't working was because spark.app.name was getting overridden by the default value set in interpreter-setting.
i did checkout your branch and got the following result.
Awesome, would you update this PR and add unit test ?
BTW, is the above UI some kind of interpreter process monitoring page ? It is pretty useful for zeppelin, do you have plan to contribute to zeppelin ?
Yup will do once i figure out why user name is spark
instead of an actual user name such as amandeep.kaur
.
I'm running Zeppelin on Mesos mode. The above is Mesos UI. Pretty useful.
@zjffdu after debugging a bit this new approach might not work for me because user is set as spark
instead of LDAP user.
@amakaur The user should be the zeppelin login user name. Do you enable LDAP in zeppelin ?
Yeah LDAP is enabled. It works as expected w/ original solution.
Oh, you might forget one thing, you need to set spark interpreter as user isolated.
I've it set as per note in isolated process
Still doesn't work for you in per user isolated mode ? What do you see in the spark app name ?
spark.app.name
is set as spark-2ESRB4NZ1
.
The following is an ex from the logs
Create Session: shared_session in InterpreterGroup: spark-2ENBYUKC2 for user: amandeep.kaur
where user is LDAP user BUT interpreterGroupId
contains spark as a user value
I would like to set spark.app.name to amandeep.kaur-2ENBYUKC2
.
Ah, you are using per note isolated, you should use per user isolated
in which case the interpreterGroupId will include user name.
I'm not sure if i'll be able to use per user isolated
for zeppelin in prod because we would like each note to have its own JVM. Would using per used isolated
create a single shared interpreter process for all of the notebooks that are under same user?
In that case, I think you can use per user isolated & per note isolated. Then the spark name would be interpreter_name
+ user_name
+ noteId
Try this branch which works for me. zjffdu@47a50c0
And set spark.app.name to be empty in interpreter setting page first.
@zjffdu Is this branch enough to achieve the goal ?
@iamabug I don't think so, check my comments above https://github.com/apache/zeppelin/pull/3498#issuecomment-548179978. I believe the right approach is to update SparkInterpreterLauncher
.
@zjffdu I believe the commit I mentioned is updating SparkInterpreterLauncher
:
if (!sparkProperties.containsKey("spark.app.name") ||
StringUtils.isBlank(sparkProperties.getProperty("spark.app.name"))) {
sparkProperties.setProperty("spark.app.name", context.getInterpreterGroupId());
}
here spark.app.name
is set to interpreterGroupId
which is generated as we want, am I right ?
Yes, maybe it is better to add prefix Zeppelin_
And this is specific to Spark interpreter, I am not sure whether we need to consider other interpreters, such as Flink.
If flink has such configuration, we could do it as well. But it could be done in another PR.
If spark.app.name
is already set, should we leave it as it is or add username and note id also ? if we leave it, all users and notes shall still have the same name, i don't think that's what we want to see. So I think the logic here could be like this:
- if
spark.app.name
is not set or set to blank, we set it toZeppelin_
+interpreter_name
+username
+noteId
, andinterpreter_name
is actuallyspark
, so it isZeppelin_spark_
+username
+noteId
. - if
spark.app.name
is already set to a string, noted byAPP_NAME
, in this case, we set it toAPP_NAME
+username
+noteId
. @zjffdu