docker-stacks icon indicating copy to clipboard operation
docker-stacks copied to clipboard

Include sparkmagic in pyspark-notebook stack

Open warrenmsmith opened this issue 4 years ago • 4 comments

Including sparkmagic enables a new use case -- enables me to stay remote but use Livy to connect to a remote spark cluster running in yarn (or any) mode. This aligns with connecting to remote EMR clusters from AWS as well as other providers.

warrenmsmith avatar Dec 08 '20 19:12 warrenmsmith

Hello @warrenmsmith,

Sorry for the late answer. It's a good suggestion to include Sparkmagic. However if at first sight, it seems obvious to add it to the pyspark-notebook image, I'm wondering if it's a good idea since the image is quite heavy with the installation of Java and Spark. One of the advantage of using Sparkmagic is to avoid installing Spark locally and all the compatibility problems related to this installation between the driver and the worker nodes. What is your opinion?

Many thanks.

romainx avatar Dec 21 '20 15:12 romainx

I agree wholeheartedly, @romainx ! The key idea here is to enable quick-and-easy usability and enable the key development concepts of "start small, finish big!". Ideally, I would begin development on my local pyspark and get kinks out, build test cases, and work with small datasets. Without changing environments, I then connect (via livy) to EMR and BOOM! I'm running the exact same analysis and code on a large cluster.

To your point, it may make sense to have a livy-only pyspark notebook, which is lighter-weight, but java and spark (client) are still necessary to run, aren't they?

The reason I believe that integrating this into the pyspark-notebook would be super-beneficial is that the installation of sparkmagic, while straightforward, is non-trivial (especially for neophytes).

It MAY make the most sense to simply build another pyspark-notebook layer that has these libraries included. But ultimately, as I look at how the next levels of connectivity are happening, the pyspark master is being phased out in favor of livy. I think it all boils down to the use case for the pyspark-notebook container. Does it exist to enable local-only development and sandboxing? Or does it exist to enable that process as the first part of a complete development workflow?

warrenmsmith avatar Dec 21 '20 15:12 warrenmsmith

👍 thanks for the answer.

To your point, it may make sense to have a livy-only pyspark notebook, which is lighter-weight, but java and spark (client) are still necessary to run, aren't they?

In fact no, Sparkmagic is only a kind of wrapper around the Livy REST API. When using it mimics a Spark direct interaction but behind the scene it's all about HTTP asynchronous calls between the notebook and the Livy server. So you do not need Java nor Spark to use Spark through Sparkmagic / Livy.

Does it exist to enable local-only development and sandboxing? Or does it exist to enable that process as the first part of a complete development workflow?

I do not know the reason behind the existence of the pyspark-notebook, however the local-only development you are talking about could be a reasonable use case. At least to prototype and test. But you're right when it comes to use some existing cluster from a notebook, it sounds difficult to use this image as is since versions of Spark and even Python have to be aligned. This is very difficult to achieve without mastering both stacks in sync. So definitively in this case sparkmagic-notebook could answer this need since it does not come with the same requirements. On the other side it will be useless if your objective is to perform local-only developments, you will need a Livy server.

romainx avatar Dec 21 '20 16:12 romainx

For information, the PR #974 was a proposal of this new image.

romainx avatar Dec 30 '20 07:12 romainx

@mathbunnyru Since we decided to close the PR, this ticket should be closed as well. Thx

Bidek56 avatar Sep 28 '22 18:09 Bidek56