mrjob
mrjob copied to clipboard
allow Spark master to be specified with 'spark.master'
Currently the Spark runner expects that the Spark master will be passed with the --spark-master
option. However, it also takes arbitrary Spark configuration properties in the form --jobconf PROP=VALUE
. Spark allows the master to specified with the spark.master
property, so the Spark runner should ideally understand --jobconf spark.master=MASTER
as well.
This actually applies to all runners, so updated the description.
Need to check what happens if we pass conflicting --conf spark.master=...
and --master=...
to spark-submit
.
It looks like it's order-dependent; spark-submit
basically treats --master=...
as an alias for --conf spark.master=...
.
I can see this potentially being an issue for some users, but for now we tell people that Spark master and deploy mode should be set explicitly (or that it's hard-coded for a particular runner).
Probably the simplest way to implement this in mrjob is the opposite of the way spark-submit
does it. Basically, when spark.master
is in the dictionary for a jobconf
opt, we override the spark_master
opt as well.
If we want to be extra tidy, we can avoid setting --conf spark.master=...
in spark-submit
command lines, using --spark-master
instead.