sparkmagic
sparkmagic copied to clipboard
SparkMagic and Livy user impersonation
Background
When a Spark notebook is executed in Jupyter, SparkMagic sends code (via REST API) to Livy which then creates a Spark job and submits it to a YARN cluster for execution. Ordinarily YARN jobs thus submitted run as user livy but many enterprise organizations want Jupyter users to be impersonated in Livy. This can be achieved by enabling Livy impersonation and adding the proxyUser
property in the spark magic configuration for each user that needs to be impersonated.
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
"proxyUser": "bob"
},
The result is of this config change is that if bob is the Notebook instance user, they are now also the user running the YARN application.
Application-Id Application-Name Application-Type User Queue
application_1526925378944_0005 livy-session-1 SPARK bob default
Proposal
Since the proxyUser
value cannot be known a-priori, it must be set individually for every user in their spark magic config json. This is not ideal because it increases configuration complexity for a multi-user enterprise requiring them to inject this property when a new user(s) is added to the system.
I'm proposing that SparkMagic support user impersonation by default - meaning it always sends proxyUser
with its value as the user name of process SparkMagic is running in to Livy when creating a new Livy session. This avoids configuration complexity for user users and makes spark magic more amenable for enterprise use. An administrator can always explicitly set a value for proxyUser in sessions_configs
JSON object and that will take precedence over the proposed default behavior of using the OS user name for impersonation.
I envision this to be low complexity change in /sparkmagic/utils/configuration.py combined with a configuration property "livy_user_impersonation": true|false
. For NO_AUTH it sends the user of the current process as the proxyUser.
Happy to get some feedback on this proposal.
This is also an issue when you launch an EMR cluster with Jupyter. When you set the Livy configuration for the cluster, in Jupyter SparkSession cannot be created:
configs:
[
{
"Classification": "livy-conf",
"Properties": {
"livy.impersonation.enabled": "true"
}
},
{
"Classification": "core-site",
"Properties": {
"hadoop.proxyuser.livy.groups": "*",
"hadoop.proxyuser.livy.hosts": "*"
}
}
]
error:
unexpected parameter proxyUser can not be empty.
Hey! I am currently needing this change. How can I help with this issue?