appengine-mapreduce
appengine-mapreduce copied to clipboard
Controller keeps rescheduling itself.
In python the controller callback (Used to update the UI) continuously runs. See:
https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1414
https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1124
The problem is that this can continue even after the job has finished. An easy patch is to get more aggressive about stopping or backing off the controller. However the preferred way, would be to move towards the way the Java impl works, by having the controller only used once to synchronize the end of the shard's life, and managing the UI elsewhere.
Any activity on this issue? It appears to affect a decent proportion of users, and the only real solution is pausing and purging the queue, with work lost/incomplete and potentially many instance-hours wasted.
examples which all have infinite-respawning :
- http://stackoverflow.com/questions/11507167/mapreduce-jobs-running-forever
- https://code.google.com/p/googleappengine/issues/detail?id=12246
- https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/PDnMm67BskI/5DV0lxo2_FgJ
- https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/RQaPGQ55hsY/dC5ZV_BuhMMJ
- https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/bcJut759svo/pVWv9ze1jNkJ
- https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/TUf4OvITSG4/wmftKhQIQBQJ
- https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/K8iWOwbIuHQ/CNVfuken0u0J
In all of these cases, the user sees an infinite retry, rather than the code aborting at some point after retrying a certain limit. In some cases, and I've seen this elsewhere (there may even be an open issue right now that abort doesn't work), abort is called but doesn't work. In all of them, the only solution is to pause and purge the queue, a very manual action for a potentially-expensive bug.
We are also seeing this issue. Reached 3000 retries before I manually deleted it from the queue. It was solely responsible for keeping our instance alive.
It does not seem to occur with a small number of models to process (~30 / 1-2 shards), but occurs with ~5000 models / 32 shards.
From the mapreduce detail screen, it looks like one leftover shard is always "running". That shard always seems to have None as its end iterator: 30 running KeyRangesIterator for Single KeyRange ASC[datastore_types.Key.from_path(u'Player', 9807806152L, u'SaveGame', u'savegame', _app=u's~redacted') to None) Unknown 00:01:23
Well that was silly of me. Our mapreduce callback was raising an exception for a couple of models, which caused the shard task to fail (and then retry, but eventually fail permanently). The failed shard task caused the controller task to run forever.
The strange thing is that I don't seem to get any error logs, even though I call logging.error inside the mapreduce callback. Is it somehow possible that logging.error wouldn't write to the log from within the callback...?
We're also seeing this issue popping up regularly. When doing backups through Google's datastore backup service, every few days we get a hanging backup, where mapreduce is infinitely scheduling jobs (probably after some failure). The only workaround is, after canceling the backup, to pause our queue for a few seconds.
I believe I read somewhere that map/reduce functions should never throw exceptions or fail, since if they do, they'll retry, and unless you specify a retry limit, you'll see these infinite retries.
An exception caught within the map/reduce handler can certainly log somewhere that this occurred, and the relevant entity / record can be stored for a future pass. If the job is small enough, it can also be re-run, since output phase tasks should be idempotent anyways.
Is this an accurate assessment of the situation here? I'd really appreciate a comment from somebody more expert.
dt-rush: You are generally correct that if the MR fails you will see something. The controller re-scheduling does not appear to be related to the success or failure of the job, but rather a race condition on its termination. Each shard is supposed to update the sate of the job as it is finishing. The the controller will detect that the job is no longer running and stop: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1124
However in this case there is something that is causing it to think the job is still active. There are three possibilities:
- The job still is active, because one of the shards is frozen for some reason. (Perhaps a task was dropped, or falsely acked to the task queue)
- The job is not active but it thinks it is because there were multiple shards completing at the same time and the transaction boundary for updating the state on completion was improperly defined. (When one shard marked itself as done, it accidentally undoes another shard's update.)
- The logic just before line 1124 is blatantly wrong, and it's not checking the state correctly.
I suspect the second is the most likely cause.
A quick inspection of all the places that a MapreduceState entity is conjured, checking to see if its active property is modified, yields:
Some analysis of each location:
- This is the
_drop_gracefullyimplementation ofControllerCallbackHandler. It will run when the super-classbase_handler.TaskQueueHandler'sinitialize()method is called and meets certain conditions (see the code). - It sets the
activeproperty toFalse, then passes the state object toControllerCallbackHandler._finalize_job(). The code trusts the_put_state()method to persist the MapreduceState entity, using@db.transactional(retries=5).
- This code sets the
activeproperty of the MapreduceState entity according to whether it hasactive_shards. Perhaps warrants an investigation into whether that can be out-of-sync with reality (one of the shards failing to report back?)
- This code sets the
activepoperty toFalseif there are no readers available. Probably not the cause of this bug, but worth looking further into. Similarly persists the entity which has just had itsactiveproperty set toFalseusingControllerCallbackHandler._finalize_job().
- This is
KickOffJobHandler's implementation of _drop_gracefully, which is required bybase_handler.TaskQueueHandler. Once again persists the entity which has just had itsactiveproperty set toFalseusingControllerCallbackHandler._finalize_job().
- This code represents the creation of a new MapreduceState entity with
activeset toTrue. It's only called from one place:StartJobHandler's_start_map()method.
mapreduce/api/map_job/map_job_control.py#L203
- This is part of the
__create_and_save_state()method of the API classJob. SetsactivetoTrue. The only place this is called from seems to be theJob's methodsubmit() - It appears this code is called by the
map_job_control.Job'ssubmit()method, and thatsubmit()is only called from tests. So, no need to investigate that further it seems.
Ah yes. The problem may be how mapreduce/handlers.py#L1178 relates to mapreduce/model.py#L1118
_update_state_from_shard_states assumes it has the latest info on the shard states and updates it to finalized if and only if all the shard states are finalized. However the shard states passed in come from find_all_by_mapreduce_state which are loaded in a non-transnational way and are therefor not guaranteed to be fresh.
Inside _update_state_from_shard_states it also checks the state to make sure it has not already been set. (This is good and correct. We'll need to check to ensure all other locations also do this.) I am not sure if there are any other properties besides 'active' than should be checked, but those are at least not causing this bug.
Have there been any updates to fixing this issue? I just found that we are running into this, leaving scheduled pipelines controllers always running, using instance hours. Are there any suggestions on how to find these? active state and other properties on the ShardState are not indexed so other than mapreducing over the whole table always looking for old active mapreduces and then manually ending the shards still active I do not know of a way to get them to stop.
Appreciate any thoughts on how to resolve, or work around.
(I'm no longer working on this project, so someone else may have more up to date info, but:) Barring fixing the actual code causing this problem, the easiest way to put a stop jobs is to pause the task queue and delete the messages there. (This can easily be done through the console.) If you have a dedicated queue, this can be non-disruptive (albeit requiring manual intervention)