dumbo
dumbo copied to clipboard
new "parallel unix" backend for running jobs over multiple local processors
I wrote this backend to enable local dumbo jobs to leverage multiple processor cores.
Minimal usage example, which will run 4 mappers in parallel and then run 4 reducers: dumbo -input [infiles] -output [outpath] -punix yes -tmpdir [tmppath] -nmappers 4 -nreducers 4
Along the way, I also added a few additional options and features.
These new options are backend-agnostic:
- inputfile: which is used instead of "input". This is helpful for when the input files are too long for the command line buffer, i.e. for big jobs.
- shell: to specify the shell to be executed instead of the default '/bin/sh'
These are specific to the new parallel unix backend:
- punix: flag to enable this backend
- nmappers: the number of mappers to run at the same time
- nreducers: the number of reducers to use, which all run simultaneously
- tmpdir: local file system path to store temporary files
- permapper: the number of input files to be handled by each mapper process. This reduces the turnover rate of creating new processes, which can be beneficial for reducing file system load if you are running dumbo on a clustered file system (optional)
Sound good! Will try to find some time to review soonish.
Two comments:
- Doesn't adhere to PEP 8 style guide.
- The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?
Thanks for the comments, Klaas, and your patience -- I'm new to this!
I will fix up the code to comply with PEP 8 and follow up here when I have addressed these issues.
Regarding the option names, I think there is a distinction to be made in this case. My intention with these options was to control the degree of multiprogramming used by the backend. In my mind, there are several reasons to set this separately from the input-/output-specific nummaptasks/numreducetasks options for the Hadoop backend, such as wanting to split the output across 100 reduce tasks but only processing 8 at a time.
Granted, as the code stands, numreducetasks is redundant with nreducers -- but given this discussion, perhaps it would be worthwhile to account for this distinction that I described above.
On Wednesday, September 4, 2013 at 10:33 AM, Klaas Bosteels wrote:
Two comments: Doesn't adhere to PEP 8 style guide. The Hadoop backend already has "nummaptasks" and "numreducetasks" options — maybe it would be better to use the same option names for this backend, instead of "nmappers" and "nreducers"?
— Reply to this email directly or view it on GitHub (https://github.com/klbostee/dumbo/pull/75#issuecomment-23799040).
I think new flags with so similar names will introduce confusion.
+1 to reuse hadoop ones. I think number of processes should be capped by number of cores (does it make any sense to have more processes than number of cores?)