sparkmagic icon indicating copy to clipboard operation
sparkmagic copied to clipboard

pyspark should upload / import packages from local by default

Open kdzhao opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe.

When pyspark starts the connection to Spark cluster (Livy), it should load the packages in the local folder by default (or at least a way to specify that), so users can use these packages in the spark session as well.

For example, in pySpark kernel, if I do :

%%local
import matplotlib

It loads successfully. This is expected because "local" reads the package matplotlib I have on the jupyterlab machine.

But if I do:

import matplotlib

Starting Spark application
ID      YARN Application ID     Kind    State   Spark UI        Driver log      Current session?
32      application_1636505612345_0200  pyspark idle    Link    Link    ✔
SparkSession available as 'spark'.
An error was encountered:
No module named 'matplotlib'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'matplotlib'

As we can see it errors out. It can't find the said package on the spark cluster because in this case it runs in the cluster.

Describe the solution you'd like

People may say why not install the said packages on the Spark cluster? Well, most of the time, end users don't have direct permissions to do that. If there is a way so pyspark kernel can upload the packages when it starts the spark session, that will be really helpful! For example, a config before start the session, in which users can specify which packages to upload.

kdzhao avatar Nov 11 '21 02:11 kdzhao

Hmmm maybe what you're looking for is this poorly documented function referenced in the AWS EMR documentation, sc.install_pypi_package?

Another option could be to use the %%bash IPython magic to call pip directly

nicolaslazo avatar Dec 14 '21 01:12 nicolaslazo

I'll extend slightly on the above request, though I believe that the suggestion that @kdzhao gave would achieve it.

In my case, I have several notebooks that I use on my EMR clusters which re-use some key functions that I would ideally like to only define in one place.

Ideally, the %run magic could be adapted to allow running a python script which is located on the jupyterlab machine on the EMR cluster.

gloisel avatar Dec 21 '21 17:12 gloisel