appengine-mapreduce icon indicating copy to clipboard operation
appengine-mapreduce copied to clipboard

Controller keeps rescheduling itself.

Open tkaitchuck opened this issue 10 years ago • 12 comments

In python the controller callback (Used to update the UI) continuously runs. See:

https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1414

https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1124

The problem is that this can continue even after the job has finished. An easy patch is to get more aggressive about stopping or backing off the controller. However the preferred way, would be to move towards the way the Java impl works, by having the controller only used once to synchronize the end of the shard's life, and managing the UI elsewhere.

tkaitchuck avatar Jul 15 '15 23:07 tkaitchuck

Any activity on this issue? It appears to affect a decent proportion of users, and the only real solution is pausing and purging the queue, with work lost/incomplete and potentially many instance-hours wasted.

dt-rush avatar Aug 14 '15 16:08 dt-rush

examples which all have infinite-respawning :

  • http://stackoverflow.com/questions/11507167/mapreduce-jobs-running-forever
  • https://code.google.com/p/googleappengine/issues/detail?id=12246
  • https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/PDnMm67BskI/5DV0lxo2_FgJ
  • https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/RQaPGQ55hsY/dC5ZV_BuhMMJ
  • https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/bcJut759svo/pVWv9ze1jNkJ
  • https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/TUf4OvITSG4/wmftKhQIQBQJ
  • https://groups.google.com/forum/#!searchin/google-appengine/mapreduce/google-appengine/K8iWOwbIuHQ/CNVfuken0u0J

In all of these cases, the user sees an infinite retry, rather than the code aborting at some point after retrying a certain limit. In some cases, and I've seen this elsewhere (there may even be an open issue right now that abort doesn't work), abort is called but doesn't work. In all of them, the only solution is to pause and purge the queue, a very manual action for a potentially-expensive bug.

dt-rush avatar Aug 14 '15 16:08 dt-rush

We are also seeing this issue. Reached 3000 retries before I manually deleted it from the queue. It was solely responsible for keeping our instance alive.

It does not seem to occur with a small number of models to process (~30 / 1-2 shards), but occurs with ~5000 models / 32 shards.

From the mapreduce detail screen, it looks like one leftover shard is always "running". That shard always seems to have None as its end iterator: 30 running KeyRangesIterator for Single KeyRange ASC[datastore_types.Key.from_path(u'Player', 9807806152L, u'SaveGame', u'savegame', _app=u's~redacted') to None) Unknown 00:01:23

ScoTay avatar Aug 18 '15 05:08 ScoTay

Well that was silly of me. Our mapreduce callback was raising an exception for a couple of models, which caused the shard task to fail (and then retry, but eventually fail permanently). The failed shard task caused the controller task to run forever.

The strange thing is that I don't seem to get any error logs, even though I call logging.error inside the mapreduce callback. Is it somehow possible that logging.error wouldn't write to the log from within the callback...?

ScoTay avatar Aug 18 '15 08:08 ScoTay

We're also seeing this issue popping up regularly. When doing backups through Google's datastore backup service, every few days we get a hanging backup, where mapreduce is infinitely scheduling jobs (probably after some failure). The only workaround is, after canceling the backup, to pause our queue for a few seconds.

remko avatar Aug 30 '15 14:08 remko

I believe I read somewhere that map/reduce functions should never throw exceptions or fail, since if they do, they'll retry, and unless you specify a retry limit, you'll see these infinite retries.

An exception caught within the map/reduce handler can certainly log somewhere that this occurred, and the relevant entity / record can be stored for a future pass. If the job is small enough, it can also be re-run, since output phase tasks should be idempotent anyways.

Is this an accurate assessment of the situation here? I'd really appreciate a comment from somebody more expert.

dt-rush avatar Sep 01 '15 17:09 dt-rush

dt-rush: You are generally correct that if the MR fails you will see something. The controller re-scheduling does not appear to be related to the success or failure of the job, but rather a race condition on its termination. Each shard is supposed to update the sate of the job as it is finishing. The the controller will detect that the job is no longer running and stop: https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/src/mapreduce/handlers.py#L1124

However in this case there is something that is causing it to think the job is still active. There are three possibilities:

  1. The job still is active, because one of the shards is frozen for some reason. (Perhaps a task was dropped, or falsely acked to the task queue)
  2. The job is not active but it thinks it is because there were multiple shards completing at the same time and the transaction boundary for updating the state on completion was improperly defined. (When one shard marked itself as done, it accidentally undoes another shard's update.)
  3. The logic just before line 1124 is blatantly wrong, and it's not checking the state correctly.

I suspect the second is the most likely cause.

tkaitchuck avatar Sep 04 '15 18:09 tkaitchuck

A quick inspection of all the places that a MapreduceState entity is conjured, checking to see if its active property is modified, yields:

dt-rush avatar Sep 28 '15 18:09 dt-rush

Some analysis of each location:

mapreduce/handlers.py#L1084

mapreduce/handlers.py#L1178

  • This code sets the active property of the MapreduceState entity according to whether it has active_shards. Perhaps warrants an investigation into whether that can be out-of-sync with reality (one of the shards failing to report back?)

mapreduce/handlers.py#L1391

  • This code sets the active poperty to False if there are no readers available. Probably not the cause of this bug, but worth looking further into. Similarly persists the entity which has just had its active property set to False using ControllerCallbackHandler._finalize_job().

mapreduce/handlers.py#L1431

mapreduce/handlers.py#L1780

  • This code represents the creation of a new MapreduceState entity with active set to True. It's only called from one place: StartJobHandler's _start_map() method.

mapreduce/api/map_job/map_job_control.py#L203

  • This is part of the __create_and_save_state() method of the API class Job. Sets active to True. The only place this is called from seems to be the Job's method submit()
  • It appears this code is called by the map_job_control.Job's submit() method, and that submit() is only called from tests. So, no need to investigate that further it seems.

dt-rush avatar Sep 28 '15 18:09 dt-rush

Ah yes. The problem may be how mapreduce/handlers.py#L1178 relates to mapreduce/model.py#L1118

_update_state_from_shard_states assumes it has the latest info on the shard states and updates it to finalized if and only if all the shard states are finalized. However the shard states passed in come from find_all_by_mapreduce_state which are loaded in a non-transnational way and are therefor not guaranteed to be fresh.

Inside _update_state_from_shard_states it also checks the state to make sure it has not already been set. (This is good and correct. We'll need to check to ensure all other locations also do this.) I am not sure if there are any other properties besides 'active' than should be checked, but those are at least not causing this bug.

tkaitchuck avatar Oct 02 '15 19:10 tkaitchuck

Have there been any updates to fixing this issue? I just found that we are running into this, leaving scheduled pipelines controllers always running, using instance hours. Are there any suggestions on how to find these? active state and other properties on the ShardState are not indexed so other than mapreducing over the whole table always looking for old active mapreduces and then manually ending the shards still active I do not know of a way to get them to stop.

Appreciate any thoughts on how to resolve, or work around.

nickfirespotter avatar Jan 12 '16 19:01 nickfirespotter

(I'm no longer working on this project, so someone else may have more up to date info, but:) Barring fixing the actual code causing this problem, the easiest way to put a stop jobs is to pause the task queue and delete the messages there. (This can easily be done through the console.) If you have a dedicated queue, this can be non-disruptive (albeit requiring manual intervention)

tkaitchuck avatar Jan 12 '16 21:01 tkaitchuck