mrjob
mrjob copied to clipboard
EMR + Mahout demo
EMR's 3.x AMIs include Mahout 0.8. Would be great to have an awesome demo that uses Mahout, that anyone can run on EMR.
Basically, we want to run something like hadoop jar /home/hadoop/mahout/mahout-core-<version>-job.jar org.apache.mahout.clustering.kmeans.KMeansDriver <args>
. We should be able to get the version of Mahout from the API.
Also, it looks like Mahout uses sequence files. Looks like there are input formats that convert sequence files to/from text; here's an example on Stack Overflow: http://stackoverflow.com/questions/5060967/how-to-use-hadoop-streaming-with-lzo-compressed-sequence-files/6364689#6364689
Probably want a MahoutStep
class that's basically JarStep
where mrjob automatically finds the jar for you.