mrjob
mrjob copied to clipboard
allow mrjob to use Java mappers/reducers for some steps
It would be nice to be able to use some of the built-in mappers/reducers from Java for effiency reasons (e.g. org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer).
probably would look something like this:
def steps(self):
return [self.mr(mapper=..., reducer_java_class='org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorReducer')]
(This is not the same thing as using JARs as steps; see #770 for that.)
Not only efficiency reasons here, but also the ability to reuse existing legacy or third party Java steps without having to rewrite them in Python.
True.
Cross-referencing this with #378, which is another kind of non-Hadoop Streaming step.
AFAICT it's non-trivial to accomplish this through the EMR API, basically because the Hadoop standard mappers and reducers don't expose a usable interface right out of the box. The best approach I can think of involves writing a separate Java class for each desired mapper/reducer that wraps it for use with the Streaming API, then either bundling those classes as a JAR or compiling them on the nodes as a bootstrap action.
I'm quite possibly missing something, though.
Haven't tried it, but the Hadoop Streaming docs have an example of mixing a Python script with aggregate
that seems pretty straightforward. It looks like keys can be whatever, and the values just have to be in some sort of recognizable numeric format; JSONProtocol
and ReprProtocol
would both work.
You're probably right.
Done! ^_^
Reopening this, because I don't think we ever did this (though it looks like you can accomplish the same thing using *_cmd
with the class name as your "command").
I'd call these options mapper_class
etc. I think at some point, we're going to want mrjob.compat
to be able to rewrite class packages (e.g. use mapreduce
for newer hadoop rather than mapred
).
Wow, this issue has been around a while! Should be pretty straightforward.
It's not clear to me if anything other than "aggregate" works here. Might not be worth it, since this is already covered by reducer_cmd
.
Pretty sure this can be handled with reducer_cmd()
, but would be good to have an working example.
A description of how to use the aggregate package is here: https://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html#Hadoop_Aggregate_Package