spark-ec2
spark-ec2 copied to clipboard
Allow to specify hadoop minor version (2.4 and 2.6 at the moment)
Thanks @jamborta - I will take a look at this soon. Meanwhile if anybody else can try this out and give any comments that will be great !
added hadoop 2.7 as it has much better s3 support
@shivaram would be good to add hadoop2.6 and hadoop2.7 to your spark-related-packages s3 bucket as pulling it from www.apache.org gets slow from time to time.
@shivaram updated the code to handle the 4 distinct range of spark versions.
@shivaram would also be good to include hadoop-2.7.0 and hadoop-2.6.0 in your s3 bucket. it takes about 10 mins to pull that from apache.org (and needs to be done twice).
@jamborta - I agree it would be more convenient to have Hadoop hosted on there as well, but the last time I brought this issue up I was told that going forward only Spark packages would be hosted on S3.
The way I got around this on my own project, Flintrock, is by downloading from the (often slow) Apache mirrors by default, but allowing users to specify an alternate download location if desired. Dunno if we want to do the same for spark-ec2.
If Databricks or the AMPLab (I'm not sure who owns the spark-related-packages
S3 bucket) hosts the various versions of Hadoop on there, it would be more convenient for everybody, but I'm not sure they want to take on that responsibility going forward. And I can understand why, since it's a cost (in time and money) and there are alternative solutions out there already.
Yeah its awkward that Apache / Amazon wouldn't have a fast way to download Hadoop on EC2. I'm talking to @pwendell and @JoshRosen to see if we can add Hadoop stuff to spark-related-packages.
Thanks to @JoshRosen I now have permission to upload to the spark-related-packages bucket. Right now Hadoop 2.6 and Hadoop 2.7 tar.gz files should be up there. @jamborta Could you test this and let me know if they work fine ?
Not to distract from the PR, but is it now "official" that new releases of Hadoop will be uploaded to S3 going forward? Or is this just a one-off?
I would like to keep this one off and not make it "official" as I dont think we have enough resources to track all Hadoop releases and keep this up to date (like say we do for Spark). However I think we can undertake hosting every Hadoop version that Spark release binaries are built for (right now this list is 2.3, 2.4, 2.6 and 2.7 for the 2.0.1 release). Does that sound good ?
Yes, that sounds good to me. The releases of Hadoop that Spark targets are the only ones that users of spark-ec2, Flintrock, and similar tools are going to care about anyway.
@shivaram updated the s3 path in the code. all works fine.
It looks like there are some conflicts - Could you resolve them ?
just done
@shivaram updated the code based on your comments. validate_spark_hadoop_version
in python should capture all the checks that are needed - but still kept the same logic spark/init.sh. i think some of the logic can be removed from init.sh, if we don't wanna duplicate the checks.
@shivaram - Sorry to keep butting in on this PR.
If we are going to start hosting Hadoop on S3, shouldn't we use the latest maintenance releases?
In other words, 2.7.3 instead of 2.7.0, 2.6.5 instead of 2.6.0, etc...
@nchammas Thats a reasonable question - I uploaded 2.7.0 as @jamborta had used that in the PR. On the one hand I dont mind starting off with 2.7.3 right now as its better to have the maintenance release. On the other hand we might have to pick / live with say the latest maintenance version as of now as I dont think we can keep updating this when say 2.7.4 comes out.
@jamborta Any thoughts on this ?
I agree that we could put the latest maintenance release for 2.4, 2.6 and 2.7 as it is now, and stick with those.
@shivaram if you upload the latest releases I can update this PR.
@jamborta 2.7.3 and 2.6.5 are now in the same S3 bucket
@shivaram code updated