mrjob
mrjob copied to clipboard
attach default hadoop formats/jobconf to protocols?
Protocols should be allowed to have HADOOP_*_FORMAT
and JOBCONF
fields, as well as hadoop_*_format()
and jobconf()
methods, which supply defaults if something is not already specified for that step. That way, they can be made to do specific useful things on their own (e.g. read binary data from a sequence file).
For jobconfs, we just combine them, with step-specific jobconf taking priority.
With input/output formats, it gets trickier. Say we're looking at the first step in the job. In decreasing level of precedence, it seems like it would make the most sense to pick input format based on:
-
step.hadoop_input_format
-
job.hadoop_input_format()
-
job.HADOOP_INPUT_FORMAT
-
input_protocol.hadoop_input_format()
-
input_protocol.HADOOP_INPUT_FORMAT
- the default (
None
)
(input_protocol
is either step.input_protocol
, job.input_protocol()
, job.INPUT_PROTOCOL
or the default, RawValueProtocol
)
What sucks about this is we can't just pick an input format for the step and then combine it with information about the job, because whether it takes precedence over job.HADOOP_INPUT_FORMAT
depends on whether it was set explicitly, or derived from the step's input protocol.
Maybe it should look more like this:
-
step.hadoop_input_format
-
step.input_protocol.hadoop_input_format()
-
step.input_protocol.HADOOP_INPUT_FORMAT
-
job.hadoop_input_format()
-
job.HADOOP_INPUT_FORMAT
-
job.input_protocol().hadoop_input_format()
-
job.input_protocol().HADOOP_INPUT_FORMAT
-
job.INPUT_PROTOCOL.hadoop_input_format()
-
job INPUT_PROTOCOL.HADOOP_INPUT_FORMAT
- the default (
None
)
This isn't actually as complicated as it looks; we just give first priority to the step definition, and use information from the job to fill in anything that's missing.
This is probably not worth the complication; better to use manifests to read arbitrary binary files (see #754).