mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

EMR + Mahout demo

Open coyotemarin opened this issue 11 years ago • 3 comments

EMR's 3.x AMIs include Mahout 0.8. Would be great to have an awesome demo that uses Mahout, that anyone can run on EMR.

coyotemarin avatar Dec 20 '13 19:12 coyotemarin

Basically, we want to run something like hadoop jar /home/hadoop/mahout/mahout-core-<version>-job.jar org.apache.mahout.clustering.kmeans.KMeansDriver <args>. We should be able to get the version of Mahout from the API.

coyotemarin avatar May 27 '16 21:05 coyotemarin

Also, it looks like Mahout uses sequence files. Looks like there are input formats that convert sequence files to/from text; here's an example on Stack Overflow: http://stackoverflow.com/questions/5060967/how-to-use-hadoop-streaming-with-lzo-compressed-sequence-files/6364689#6364689

coyotemarin avatar May 27 '16 22:05 coyotemarin

Probably want a MahoutStep class that's basically JarStep where mrjob automatically finds the jar for you.

coyotemarin avatar May 27 '16 22:05 coyotemarin