mrjob attach default hadoop formats/jobconf to protocols?

attach default hadoop formats/jobconf to protocols?

Open coyotemarin opened this issue 11 years ago • 1 comments

Protocols should be allowed to have HADOOP_*_FORMAT and JOBCONF fields, as well as hadoop_*_format() and jobconf() methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).

For jobconfs, we just combine them, with step-specific jobconf taking priority.

With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:

step.hadoop_input_format
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
input_protocol.hadoop_input_format()
input_protocol.HADOOP_INPUT_FORMAT
the default (None)

(input_protocol is either step.input_protocol, job.input_protocol(), job.INPUT_PROTOCOL or the default, RawValueProtocol)

What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over job.HADOOP_INPUT_FORMAT depends on whether it was set explicitly, or derived from the step's input protocol.

Maybe it should look more like this:

step.hadoop_input_format
step.input_protocol.hadoop_input_format()
step.input_protocol.HADOOP_INPUT_FORMAT
job.hadoop_input_format()
job.HADOOP_INPUT_FORMAT
job.input_protocol().hadoop_input_format()
job.input_protocol().HADOOP_INPUT_FORMAT
job.INPUT_PROTOCOL.hadoop_input_format()
job INPUT_PROTOCOL.HADOOP_INPUT_FORMAT
the default (None)

This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.

Nov 08 '13 19:11 coyotemarin

This is probably not worth the complication; better to use manifests to read arbitrary binary files (see #754).

Feb 22 '18 21:02 coyotemarin

mrjob mrjob copied to clipboard

attach default hadoop formats/jobconf to protocols?

mrjob
mrjob copied to clipboard