almond
almond copied to clipboard
Is it possible to use provided spark?
Are there any ways to use provided Spark instead of downloading it on a notebook? In my case, install Jupyter on Dataproc where Spark package is provided.
It seems to be possible if SPARK_HOME can be configured.
I'd say this is the most common deployment type (i.e. Spark being provided by the container) for businesses.
@aishfenton I agree… Yet this poses a number of challenges.
When running spark calculations from the kernel, it acts as the driver. Its classpath is the one of almond, plus the user-added dependencies. If one relies on a spark distribution, the classpath of the executors corresponds to jars in the spark distribution (plus those passed as spark.jars
I think).
That means the classpath on the driver (almond) and the executors (spark distrib) don't necessarily match.
I ran in numerous issues even with (very) minor differences between the driver and executor classpaths (like two versions of the JAR of scala-library landing in the executor classpath, something like 2.11.2 and 2.11.7 IIRC, making List deserialization fail).
In the past, I circumvented that by using a vendored spark version as a Maven dependency from almond (rather than via a spark distribution), and only using spark configuration files from the spark distribution.
Yet @dynofu seems to have successfully used a spark distribution via ammonite-spark. I don't know how far he went though…
you can take a look my scripts build on top of Ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr
and the spark.jars
will use whatever on the emr by ammonite-spark https://github.com/dyno/ammonite_with_spark_on_emr/blob/master/emr.sc#L33.
If one were to get a spark distribution working via ammonite-spark, what more would be needed to be able to get the same functionality surfaced within an almond kernel?