appengine-mapreduce
appengine-mapreduce copied to clipboard
Any plan for the Go version?
We're planning to build some features based on appengine-mapreduce. Just wanted to know if there's any plan for the Go version. Or if it's left to the community to contribute.
If there's a Go version coming, we might hold until it got released. Or we'll just use the python version instead.
Thank you.
There aren't currently any plans. If someone wants to build a Go version, I would be happy to advise. It would not be as complex as the Java or Python versions, as API it would be written against is higher level than what Java and Python had to work with. Of course using Python is going to be quicker than writing a client in Go.
We've decided to use the Python version for now. Not quite satisfied with the performance especially when using urlfetch for some external resource. But we just don't have enough time to build the Go version mapreduce framework under heavy deliver pressure. Might rewrite our current code base with Go and contribute some code to the framework when we have extra staff and time in the future.
If you want higher performance immediately, the Java version is considerably faster.
@tkaitchuck Perhaps a bit unrelated, but was there any thought put forward about pulling the ShufflePipeline
out in such a way that a java module in a dedicated taskqueue could handle the shuffle using java's in-memory optimizations?
The Java parts are already available. see ShufflerServlet and its setup.
That should be enough though it would be nice also to have the Python side available. Maybe @AngryBrock can comment on that.
@aozarov Whoa! That flew right under my radar!
As someone who has never launched a java appengine module before, is there any guide for deploying something like this an integrating it with the python mr api?
setup...
We should make it easier, or at least document it, but it should be something like that:
- download the (AE Java SDK](https://cloud.google.com/appengine/downloads)
- Get the MR code.
- install ant
-
cd appengine-mapreduce/java/
-
ant compile_example
-
cd example
-
mkdir shuffler/WEB-INF/lib
-
cp default/WEB-INF/lib/* shuffler/WEB-INF/lib/
- replace the application name with your application name in shuffler/WEB-INF/appengine-web.xml and optionally update the version.
- upload
./appengine-java-sdk/bin/appcfg.sh update shuffler
to the desired module and version.
Read the ShufflerServlet for understating input and output expectations but I hope @AngryBrock will be able to contribute its work on creating the Python glue side.
There is no work planned to make this available on the Python side at this time.
In case anyone is interested, I started working on a Go implementation this week and have the basic pieces working to split and iterate over the datastore using the same task & slice execution logic. When the code is in a neater state and the API is a bit more stable I'll make the repo public and post a link here.
My motivation was the slow execution and frequent memory issues encountered when using the Python version (and I'm just not interested in running Java). Running it on Go is significantly faster and memory is no longer a problem - instead of ~20 items at a time and lots of OOM errors it can process 1,000's of entities with very little memory quite happily on an F1 instance.
CaptainCodeman: It might help you to utilize the Java shuffler servlet: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/java/src/main/java/com/google/appengine/tools/mapreduce/servlets/ShufflerServlet.java If you look at the tests it will show how to call it. That would allow you to just write the Map stage and get a full blown MR up and running quickly. The Java Shuffler is very efficient already and scales well too.
Thanks for the tip @tkaitchuck, I'll definitely check that out. I was hoping to keep it end-to-end in Go but we'll see how far I get with things. My main focus initially is the datastore mapping part but I want to make it support the other pieces if I can.
A couple of questions for you while you're 'here':
There's quite a bit of code duplication in the python version, am I right in assuming that the src/mapreduce/api/map_job code is intended to be a refactored api but isn't fully used right now? I've been working from the code in the root src/mapreduce folder as that seems to be where most of the commit action has been.
Also, it looks like the mapreduce job can be used within the pipeline api but doesn't actually use the pipeline api itself to execute (so many things with the same names!). Is that right? I have just been looking at the pipeline api wondering if that should be used for all the task running magic (did mapreduce exist before the pipeline api?)
Thanks!
Yes. We have for a long time been using the Java shuffler servlet for doing large python Jobs because the Java shuffler so much better. (And it is meant to inter-operate across languages as it can be deployed as-is in it's own module, so your app can remain pure Go)
Yes. You are correct about the code duplication. The other directory was a large refactoring that never was fully finished. It has a nicer API but isn't up to snuff.
Yes. MR and Pipelines interact quite a bit. Pipelines came first, but it doesn't really scale to the level of parallelism required to use it directly for MR without doing some fancy tricks. Hence the need to canonicalize those. It does however work, and work well, so it is very useful for doing the high level coordination needed to string multiple jobs together or even just a map-shuffle-reduce.
Perfect, thanks for the help!
Sorry for the lack of progress, I had to work on other things, but for anyone interested in a Go version of the mapper I've just published what I've been working on here:
https://github.com/CaptainCodeman/datastore-mapper
My original plans for it have changed somewhat as I'm now using BigQuery more for reporting. This means I don't really need the shuffle or reduce phases or the pipeline so instead of trying to create a direct Go implementation of the whole generic map-reduce framework I've focused more on the datastore mapper.
This can be used for things like datastore schema migrations, light (counter based) aggregations and exporting to BigQuery either directly or by exporting to Cloud Storage. I know there is the datastore admin for that but it's still rather slow and I wanted to be able to control the format and use my Go app models.
Any feedback or suggestions are welcome.
for anyone stumbling on this and looking for an interface from Python MR to Java Shuffler... I copied over the code for the delivered MapreducePipeline
and substituted in this custom shuffler pipeline:
class JavaShufflePipeline(Pipeline):
async = True
def run(self, job_name, mapper_params, filenames, shards=None):
# TODO: task payload will blow up if the filenames list is too long (can only be like 100k)
bucket_name = mapper_params["bucket_name"]
queue_name = os.environ.get("HTTP_X_APPENGINE_QUEUENAME")
# build shuffleServlet call
shuffleParams = {
'shufflerQueue': queue_name,
'gcsBucket': bucket_name,
'inputFileNames': [f[len(bucket_name)+2:] for f in filenames],
'outputDir': '{name}/{id}/shuffle'.format(name=job_name, id=self.root_pipeline_id),
'outputShards': shards or len(filenames),
'callbackQueue': queue_name,
'callbackModule': modules.get_current_module_name(),
'callbackVersion': modules.get_current_version_name(),
'callbackPath': self.get_callback_url()
}
log.info('/shufflerServlet - {}'.format(shuffleParams))
taskqueue.add(url='/shufflerServlet', payload=json.dumps(shuffleParams), target="wherever_you_deployed_the_shuffler", queue_name="your_queue_name")
def callback(self, job, status, output=None):
log.info("JavaShufflePipeline.callback(job={}, status={}, output={})".format(job, status, output))
if status == 'failed':
self.abort()
self.complete()
else: # status == 'done'
filename = '/' + app_identity.get_default_gcs_bucket_name() + '/' + output
with gcs.cloudstorage.open(filename) as fp:
filenames = fp.read()
filenames = [fn for fn in filenames.split('\n') if fn]
self.complete(filenames)