appengine-mapreduce
appengine-mapreduce copied to clipboard
Multiple output files
I've tested the demo app and when I run one of the three mapreduces I obtain the output fragmented in several files in the blobstorage. Is there a way of obtain all the outputs in the same file?
In the main page, after you run the mapreduces and go to the output links, it shows only the content of one of those files. I think is not correct to show only part of the output.
I've done more test with this library and I check that with enough data the output is divided into differents files (one per shard).
In the wiki, in the GoogleCloudStorageOutputWriter section I've read "These segs live in a tmp directory and should be combined and renamed to the final location. In current impl, they are not combined.". Is that refered to that I've just comment?
- In the same documentation, a little above, it actually mention that there would be a file per shard.
- In current implementation if
GoogleCloudStorageOutputWriter
is constructed with_NO_DUPLICATE=True
in the writer spec you may also have multiple files per shard otherwise only one file per shard should be created.