mrjob
mrjob copied to clipboard
Run MapReduce jobs on Hadoop or Amazon Web Services
EMR's 3.x AMIs include Mahout 0.8. Would be great to have an awesome demo that uses Mahout, that anyone can run on EMR.
Currently if mrjob can't fetch logs via SSH, it goes to S3 for them. When it does this fallback, it waits for 10 minutes for the logs to be fully...
It would be nice to be able to use some of the built-in mappers/reducers from Java for effiency reasons (e.g. org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer). probably would look something like this: ``` def steps(self):...
Currently there seem to be no tests of what happens when a job run by the sim runners throws an exception. We need to test: - [ ] in inline...
We should enable the ability to use a custom machine image (see #1805) on Dataproc.
` from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word, counts): yield(word, sum(counts)) if __name__ == '__main__': MRWordCount.run() ` I...
Now that Amazon bills by the second rather than the full hour, cluster pooling is not usually a good way to save money. However, it does save you from having...
This seems to be an issue specific to jobs and clusters that are currently running. Possibly we're using different values of "now", and the script is running a long time?
Currently, we fetch bootstrap logs from S3 only. This means we have to wait an extra few minutes for the cluster to terminate before we can find out why bootstrapping...
The ability to more explicitly assign an input/output/internal protocol for a step when in the `Job.steps()` method would be great. For example; ``` python def steps(self): return [MRStep(mapper=None, reducer=None, output_protocol=FooBar),...