mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

allow mrjob to use Java mappers/reducers for some steps

Open coyotemarin opened this issue 14 years ago • 11 comments

It would be nice to be able to use some of the built-in mappers/reducers from Java for effiency reasons (e.g. org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer).

probably would look something like this:

def steps(self):
    return [self.mr(mapper=..., reducer_java_class='org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer')]

(This is not the same thing as using JARs as steps; see #770 for that.)

coyotemarin avatar Oct 23 '10 21:10 coyotemarin

Not only efficiency reasons here, but also the ability to reuse existing legacy or third party Java steps without having to rewrite them in Python.

tigra avatar Nov 05 '10 01:11 tigra

True.

coyotemarin avatar Nov 05 '10 16:11 coyotemarin

Cross-referencing this with #378, which is another kind of non-Hadoop Streaming step.

irskep avatar Mar 30 '12 22:03 irskep

AFAICT it's non-trivial to accomplish this through the EMR API, basically because the Hadoop standard mappers and reducers don't expose a usable interface right out of the box. The best approach I can think of involves writing a separate Java class for each desired mapper/reducer that wraps it for use with the Streaming API, then either bundling those classes as a JAR or compiling them on the nodes as a bootstrap action.

I'm quite possibly missing something, though.

slingamn avatar Jul 08 '12 08:07 slingamn

Haven't tried it, but the Hadoop Streaming docs have an example of mixing a Python script with aggregate that seems pretty straightforward. It looks like keys can be whatever, and the values just have to be in some sort of recognizable numeric format; JSONProtocol and ReprProtocol would both work.

coyotemarin avatar Jul 09 '12 22:07 coyotemarin

You're probably right.

slingamn avatar Jul 09 '12 23:07 slingamn

Done! ^_^

irskep avatar Aug 22 '12 22:08 irskep

Reopening this, because I don't think we ever did this (though it looks like you can accomplish the same thing using *_cmd with the class name as your "command").

I'd call these options mapper_class etc. I think at some point, we're going to want mrjob.compat to be able to rewrite class packages (e.g. use mapreduce for newer hadoop rather than mapred).

coyotemarin avatar Sep 18 '13 20:09 coyotemarin

Wow, this issue has been around a while! Should be pretty straightforward.

coyotemarin avatar Nov 15 '13 21:11 coyotemarin

It's not clear to me if anything other than "aggregate" works here. Might not be worth it, since this is already covered by reducer_cmd.

coyotemarin avatar Apr 23 '16 23:04 coyotemarin

Pretty sure this can be handled with reducer_cmd(), but would be good to have an working example.

A description of how to use the aggregate package is here: https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package

coyotemarin avatar Sep 08 '18 00:09 coyotemarin