mrjob issues

EMR + Mahout demo

3

EMR's 3.x AMIs include Mahout 0.8. Would be great to have an awesome demo that uses Mahout, that anyone can run on EMR.

coyotemarin

Feature

0.5.0: poll EMR logs for counters

6

Currently if mrjob can't fetch logs via SSH, it goes to S3 for them. When it does this fallback, it waits for 10 minutes for the logs to be fully...

andrey-cliqz

Cleanup

allow mrjob to use Java mappers/reducers for some steps

11

It would be nice to be able to use some of the built-in mappers/reducers from Java for effiency reasons (e.g. org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer). probably would look something like this: ``` def steps(self):...

coyotemarin

Feature

test exception handing in sim runners

Currently there seem to be no tests of what happens when a job run by the sim runners throws an exception. We need to test: - [ ] in inline...

coyotemarin

Testing

support image_id on Dataproc

We should enable the ability to use a custom machine image (see #1805) on Dataproc.

coyotemarin

Feature

Permission Error on Windows

4

` from mrjob.job import MRJob class MRWordCount(MRJob): def mapper(self, _, line): for word in line.split(): yield(word, 1) def reducer(self, word, counts): yield(word, sum(counts)) if __name__ == '__main__': MRWordCount.run() ` I...

HOLMEScdk

Bug

mrjob audit-emr-usage should show bootstrap savings due to cluster sharing

3

Now that Amazon bills by the second rather than the full hour, cluster pooling is not usually a good way to save money. However, it does save you from having...

coyotemarin

Feature

mrjob audit-emr-usage shows negative run time

This seems to be an issue specific to jobs and clusters that are currently running. Possibly we're using different values of "now", and the script is running a long time?

coyotemarin

Bug

fetch bootstrap logs via SSH

1

Currently, we fetch bootstrap logs from S3 only. This means we have to wait an extra few minutes for the cluster to terminate before we can find out why bootstrapping...

coyotemarin

Feature

Ability to specify protocols for a specific step via MRStep

3

The ability to more explicitly assign an input/output/internal protocol for a step when in the `Job.steps()` method would be great. For example; ``` python def steps(self): return [MRStep(mapper=None, reducer=None, output_protocol=FooBar),...

tarnfeld

Feature

mrjob
mrjob copied to clipboard

Metadata

EMR + Mahout demo

0.5.0: poll EMR logs for counters

allow mrjob to use Java mappers/reducers for some steps

test exception handing in sim runners

support image_id on Dataproc

Permission Error on Windows

mrjob audit-emr-usage should show bootstrap savings due to cluster sharing

mrjob audit-emr-usage shows negative run time

fetch bootstrap logs via SSH

Ability to specify protocols for a specific step via MRStep

← Metadata

Owner

Metadata

mrjob mrjob copied to clipboard

Metadata

← Metadata

Owner

Metadata

mrjob
mrjob copied to clipboard