dumbo
dumbo copied to clipboard
Python module that allows one to easily write and run Hadoop programs.
Whenever a jumbo job is run, this warning appears: `11/07/07 13:23:25 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.`
python setup.py install, was giving following error: Searching for typedbytes Reading https://pypi.python.org/simple/typedbytes/ No local packages or download links found for typedbytes error: Could not find suitable distribution for Requirement.parse('typedbytes') So...
Wiki and project links updated. issue #90
I tried links from README file and both of them seams to be dead
I tried to run with the "-fake yes" option but the job got launched never the less. I was using dumbo.Job and looking at the code I don't see where...
I am using Hadoop streaming with -io typedbytes and set mapred.reduce.tasks=2, but I finally got only one output file. And if I set mapred.reduce.tasks=0, then I got many output files....
I wrote this backend to enable local dumbo jobs to leverage multiple processor cores. Minimal usage example, which will run 4 mappers in parallel and then run 4 reducers: dumbo...
Now to get source path from the mapper routine just add **kwargs to the arguments list. Here are some examples. ``` @dumbo.decor.primary def map_primary(key, value, **kwargs): key, value = value.strip().split('\t')...