sparkmagic icon indicating copy to clipboard operation
sparkmagic copied to clipboard

SparkMagic and Livy user impersonation

Open ranjitiyer opened this issue 6 years ago • 2 comments

Background

When a Spark notebook is executed in Jupyter, SparkMagic sends code (via REST API) to Livy which then creates a Spark job and submits it to a YARN cluster for execution. Ordinarily YARN jobs thus submitted run as user livy but many enterprise organizations want Jupyter users to be impersonated in Livy. This can be achieved by enabling Livy impersonation and adding the proxyUser property in the spark magic configuration for each user that needs to be impersonated.

  "session_configs": {
    "driverMemory": "1000M",
    "executorCores": 2
    "proxyUser": "bob"
  },

The result is of this config change is that if bob is the Notebook instance user, they are now also the user running the YARN application.

Application-Id	    Application-Name	    Application-Type	      User	     Queue
application_1526925378944_0005	      livy-session-1	               SPARK	       bob	   default

Proposal

Since the proxyUser value cannot be known a-priori, it must be set individually for every user in their spark magic config json. This is not ideal because it increases configuration complexity for a multi-user enterprise requiring them to inject this property when a new user(s) is added to the system.

I'm proposing that SparkMagic support user impersonation by default - meaning it always sends proxyUser with its value as the user name of process SparkMagic is running in to Livy when creating a new Livy session. This avoids configuration complexity for user users and makes spark magic more amenable for enterprise use. An administrator can always explicitly set a value for proxyUser in sessions_configs JSON object and that will take precedence over the proposed default behavior of using the OS user name for impersonation.

I envision this to be low complexity change in /sparkmagic/utils/configuration.py combined with a configuration property "livy_user_impersonation": true|false. For NO_AUTH it sends the user of the current process as the proxyUser.

Happy to get some feedback on this proposal.

ranjitiyer avatar May 21 '18 22:05 ranjitiyer

This is also an issue when you launch an EMR cluster with Jupyter. When you set the Livy configuration for the cluster, in Jupyter SparkSession cannot be created:

configs:

[
    {
      "Classification": "livy-conf",
      "Properties": {
        "livy.impersonation.enabled": "true"
      }
    },
    {
      "Classification": "core-site",
      "Properties": {
        "hadoop.proxyuser.livy.groups": "*",
        "hadoop.proxyuser.livy.hosts": "*"
      }
    }
  ]

error:

unexpected parameter proxyUser can not be empty.

maziyarpanahi avatar Feb 28 '21 10:02 maziyarpanahi

Hey! I am currently needing this change. How can I help with this issue?

josechudev avatar Nov 22 '22 19:11 josechudev