spark-ec2 Allow to specify hadoop minor version (2.4 and 2.6 at the moment)

Oct 02 '16 22:10 jamborta

Thanks @jamborta - I will take a look at this soon. Meanwhile if anybody else can try this out and give any comments that will be great !

Oct 03 '16 22:10 shivaram

added hadoop 2.7 as it has much better s3 support

Oct 08 '16 21:10 jamborta

@shivaram would be good to add hadoop2.6 and hadoop2.7 to your spark-related-packages s3 bucket as pulling it from www.apache.org gets slow from time to time.

Oct 10 '16 10:10 jamborta

@shivaram updated the code to handle the 4 distinct range of spark versions.

Oct 19 '16 19:10 jamborta

@shivaram would also be good to include hadoop-2.7.0 and hadoop-2.6.0 in your s3 bucket. it takes about 10 mins to pull that from apache.org (and needs to be done twice).

Oct 19 '16 20:10 jamborta

@jamborta - I agree it would be more convenient to have Hadoop hosted on there as well, but the last time I brought this issue up I was told that going forward only Spark packages would be hosted on S3.

The way I got around this on my own project, Flintrock, is by downloading from the (often slow) Apache mirrors by default, but allowing users to specify an alternate download location if desired. Dunno if we want to do the same for spark-ec2.

If Databricks or the AMPLab (I'm not sure who owns the spark-related-packages S3 bucket) hosts the various versions of Hadoop on there, it would be more convenient for everybody, but I'm not sure they want to take on that responsibility going forward. And I can understand why, since it's a cost (in time and money) and there are alternative solutions out there already.

Oct 19 '16 20:10 nchammas

Yeah its awkward that Apache / Amazon wouldn't have a fast way to download Hadoop on EC2. I'm talking to @pwendell and @JoshRosen to see if we can add Hadoop stuff to spark-related-packages.

Oct 19 '16 20:10 shivaram

Thanks to @JoshRosen I now have permission to upload to the spark-related-packages bucket. Right now Hadoop 2.6 and Hadoop 2.7 tar.gz files should be up there. @jamborta Could you test this and let me know if they work fine ?

Oct 20 '16 22:10 shivaram

Not to distract from the PR, but is it now "official" that new releases of Hadoop will be uploaded to S3 going forward? Or is this just a one-off?

Oct 20 '16 22:10 nchammas

I would like to keep this one off and not make it "official" as I dont think we have enough resources to track all Hadoop releases and keep this up to date (like say we do for Spark). However I think we can undertake hosting every Hadoop version that Spark release binaries are built for (right now this list is 2.3, 2.4, 2.6 and 2.7 for the 2.0.1 release). Does that sound good ?

Oct 20 '16 22:10 shivaram

Yes, that sounds good to me. The releases of Hadoop that Spark targets are the only ones that users of spark-ec2, Flintrock, and similar tools are going to care about anyway.

Oct 21 '16 02:10 nchammas

@shivaram updated the s3 path in the code. all works fine.

Oct 24 '16 16:10 jamborta

It looks like there are some conflicts - Could you resolve them ?

Oct 24 '16 16:10 shivaram

just done

Oct 24 '16 16:10 jamborta

@shivaram updated the code based on your comments. validate_spark_hadoop_version in python should capture all the checks that are needed - but still kept the same logic spark/init.sh. i think some of the logic can be removed from init.sh, if we don't wanna duplicate the checks.

Oct 24 '16 20:10 jamborta

@shivaram - Sorry to keep butting in on this PR.

If we are going to start hosting Hadoop on S3, shouldn't we use the latest maintenance releases?

In other words, 2.7.3 instead of 2.7.0, 2.6.5 instead of 2.6.0, etc...

Oct 24 '16 20:10 nchammas

@nchammas Thats a reasonable question - I uploaded 2.7.0 as @jamborta had used that in the PR. On the one hand I dont mind starting off with 2.7.3 right now as its better to have the maintenance release. On the other hand we might have to pick / live with say the latest maintenance version as of now as I dont think we can keep updating this when say 2.7.4 comes out.

@jamborta Any thoughts on this ?

Oct 25 '16 17:10 shivaram

I agree that we could put the latest maintenance release for 2.4, 2.6 and 2.7 as it is now, and stick with those.

Oct 25 '16 21:10 jamborta

@shivaram if you upload the latest releases I can update this PR.

Oct 26 '16 10:10 jamborta

@jamborta 2.7.3 and 2.6.5 are now in the same S3 bucket

Oct 29 '16 21:10 shivaram

@shivaram code updated

Oct 31 '16 11:10 jamborta

spark-ec2 spark-ec2 copied to clipboard

Allow to specify hadoop minor version (2.4 and 2.6 at the moment)

spark-ec2
spark-ec2 copied to clipboard