dumbo
dumbo copied to clipboard
Python module that allows one to easily write and run Hadoop programs.
Does Dumbo support custom input file formats e.g. WholeFileInputFormat.class which treats the entire file contents as a single record? I compiled WholeFileInputFormat.java (from Hadoop: The Definitive Guide) and created a...
By default memlimit should be infinity.
Some jobs do not produce any output for ex. uploading input data to external storage or something like this. Dumbo expected that each reducer or mapper yields some data otherwise...
Streaming backend assumes that input format is typedbytes even if -inputformat argument is 'text': https://github.com/klbostee/dumbo/blob/release-0.21.36/dumbo/backends/streaming.py#L81 This leads to apply typedbytes.PairedInput to all input lines: https://github.com/klbostee/dumbo/blob/release-0.21.36/dumbo/core.py#L380 Appling util.loadtext instead of typedbytes.PairedInput...
How about adding support for SequenceFiles for local runs? It seems it would be a matter of adding SequenceFIle decoder/encoder, just like 'code' format works today.
Sounds like all that's needed is a new backend to talks to s3 file system and EMR jobflow control (via boto API). Essential features: - Read input from and write...
Hello! I am trying to run a job for our data team and we are getting errors using dumbo. We are using the latest version of Dumbo and Cloudera. Command...
// my python job def mapper(key, value): yield value.split(" ")[0], 1 def reducer(key, values): yield key, sum(values) if **name** == "**main**": import dumbo dumbo.run(mapper, reducer, combiner=reducer) // my command (version...
Custom mapper `cleanup` function would never be called in case of MultiMapper usage.
If I write a map function with the alternative low-level single-parameter interface, then give it to `MultiMapper`: ``` import dumbo from dumbo.lib import MultiMapper from dumbo.decor import primary @primary def...