generic-worker icon indicating copy to clipboard operation
generic-worker copied to clipboard

exit if remainingTasks is less than 0 (vs just 0)

Open aerickson opened this issue 5 years ago • 1 comments

g-w should exit if remainingTasks is less than or equal to 0.

Bitbar workers recently had an issue (due to our homegrown supersede detection) where we had a bad worker that failed ~180 tasks because g-w continued to work and didn't exit due to this. We configure g-w to exit after 1 task. We didn't use to delete the completed job counter, so remainingTasks was < 0 and g-w continued to run until it exceeded Bitbar's allowed container run time.

related:

  • fix in mozilla-bitbar-docker to delete counter file: https://github.com/bclary/mozilla-bitbar-docker/pull/28
  • wander dealing with the counter file: https://bugzilla.mozilla.org/show_bug.cgi?id=1375514
  • our supersede detection: https://bugzilla.mozilla.org/show_bug.cgi?id=1563307

aerickson avatar Oct 23 '19 22:10 aerickson

Hi Andrew,

Many thanks for the PR! We're slap bang in the middle of a migration at the moment which is top priority, so it may be a while before I get back to this, but I appreciate the PR.

I think it may be time for us to make numberOfTasksToRun a disallowed property when the engine is already limited to running a single task before exiting.

At the moment, reboots are required between tasks in the multiuser engine, but at some point we'll have support for headless tasks on multiuser that don't require a graphical interface, and then multiuser engine should be able to run multiple tasks at a time without exiting between them, so then the numberOfTasksToRun config setting is useful. But in that case it would not need to persist this information to a file, so maybe we can abandon reading/writing the tasks-resolved-count.txt file entirely.

This is a throw back to when we didn't have a notion of "engine" in generic-worker and each platform operated slightly differently. In order for numberOfTasksToRun to make sense on Windows where only a single task could run at a time, I chose to persist the total in the tasks-resolved-count.txt file, so you could say "run 100 tasks and then kill the machine" and the worker would run tasks with reboots between, but after 100 tasks had been run, it would have a different exit code to say that the worker had completed everything it needed to do:

    0      Tasks completed successfully; no more tasks to run (see config setting
           numberOfTasksToRun).
...
...
    67     A task user has been created, and the generic-worker needs to reboot in order
           to log on as the new task user. Note, the reboot happens automatically unless
           config setting disableReboots is set to true - in either code this exit code will
           be issued.

But it has caused a lot of confusion, so I think the time to rethink it a little is here.

I'd like to park this while we have our platform migration going on, so it won't be until later in November that I will have much time to come back to this, but it sounds like you have a workaround in place (deleting tasks-resolved-count.txt) so hopefully this won't block you.

Thanks again!

petemoore avatar Oct 25 '19 10:10 petemoore