spark-ec2
spark-ec2 copied to clipboard
--spark-ec2-compressed option added.
Description
--spark-ec2-compressed
option enables to precise a compressed version of spark-ec2. This option is an alternative to cloning spark-ec2 from GitHub.
Accepted compression format
.tar, .tar.gz, .tar.bz2, .tar.xz
Could you explain the motivation for this change ?
I worked for a company that has a private GitLab, it was useful to me to have this feature because I cannot access the GitLab from the outside.
It could be nice to have an alternative to GitHub, if there is any problem with GitHub or the repository, you could continue to deploy cluster without wasting time.
Hmm - but the git clone
here is happening on the master machine -- Is the assumption that the master machine cannot access artifacts from the public internet ? In that case a lot of other things like installing Spark or HDFS will also fail ?
It doesn't assume the situation where the master has no access to the Internet.
But those cases:
- GitHub service outage (https://status.github.com/messages)
- The spark-ec2 repository get corrupted or deleted...
In that case can we simplify this and just take a URL to a tgz that can be used to do wget
on the master ? It will simplify the code more and even github has urls of the form https://github.com/amplab/spark-ec2/archive/branch-1.6.zip
Super idea :+1: ! So we can remove git clone
and rsync
, and replace them by a simple wget ?
To be more conservative I'd make the zip file path a command line option and if the option is present, we can use wget
. If not it'll still use the existing code path