mrjob allow mrjob to use Java mappers/reducers for some steps

It would be nice to be able to use some of the built-in mappers/reducers from Java for effiency reasons (e.g. org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer).

probably would look something like this:

def steps(self):
    return [self.mr(mapper=..., reducer_java_class='org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer')]

(This is not the same thing as using JARs as steps; see #770 for that.)

Oct 23 '10 21:10 coyotemarin

Not only efficiency reasons here, but also the ability to reuse existing legacy or third party Java steps without having to rewrite them in Python.

Nov 05 '10 01:11 tigra

True.

Nov 05 '10 16:11 coyotemarin

Cross-referencing this with #378, which is another kind of non-Hadoop Streaming step.

Mar 30 '12 22:03 irskep

AFAICT it's non-trivial to accomplish this through the EMR API, basically because the Hadoop standard mappers and reducers don't expose a usable interface right out of the box. The best approach I can think of involves writing a separate Java class for each desired mapper/reducer that wraps it for use with the Streaming API, then either bundling those classes as a JAR or compiling them on the nodes as a bootstrap action.

I'm quite possibly missing something, though.

Jul 08 '12 08:07 slingamn

Haven't tried it, but the Hadoop Streaming docs have an example of mixing a Python script with aggregate that seems pretty straightforward. It looks like keys can be whatever, and the values just have to be in some sort of recognizable numeric format; JSONProtocol and ReprProtocol would both work.

Jul 09 '12 22:07 coyotemarin

You're probably right.

Jul 09 '12 23:07 slingamn

Done! ^_^

Aug 22 '12 22:08 irskep

Reopening this, because I don't think we ever did this (though it looks like you can accomplish the same thing using *_cmd with the class name as your "command").

I'd call these options mapper_class etc. I think at some point, we're going to want mrjob.compat to be able to rewrite class packages (e.g. use mapreduce for newer hadoop rather than mapred).

Sep 18 '13 20:09 coyotemarin

Wow, this issue has been around a while! Should be pretty straightforward.

Nov 15 '13 21:11 coyotemarin

It's not clear to me if anything other than "aggregate" works here. Might not be worth it, since this is already covered by reducer_cmd.

Apr 23 '16 23:04 coyotemarin

Pretty sure this can be handled with reducer_cmd(), but would be good to have an working example.

A description of how to use the aggregate package is here: https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package

Sep 08 '18 00:09 coyotemarin

mrjob mrjob copied to clipboard

allow mrjob to use Java mappers/reducers for some steps

mrjob
mrjob copied to clipboard