procrastinate icon indicating copy to clipboard operation
procrastinate copied to clipboard

Reconsider shutdown_timeout of the worker

Open medihack opened this issue 1 year ago • 4 comments

In the v3 branch, the worker has a shutdown_time . The docs v3 currently say, If a shutdown_timeout option is specified, the worker will attempt to abort all jobs that have not been completed by that time. This has not been implemented, but the timeout cancels the asyncio task that runs all side coroutines and fetches/processes the jobs. It does not abort the running jobs.

There were several things discussed on Discord:

  • Adapt the documentation
  • Send an abort request to running jobs
  • Introduce another timeout that sends the abort request
  • Provide an additional reason for the abortion

medihack avatar Sep 06 '24 21:09 medihack

This is now being implemented for async tasks. Jobs of such a task are set to ABORTED state when the shutdown_timeout ended (because this raises a CancelledError). This is problematic for sync tasks. Jobs of those tasks are also set to ABORTED, but the thread of that task may still be running in the background and we can't do anything about it. One way we could handle this is to ignore the CancelledError for sync tasks and send an abort request to such a job (which can then be checked with should_abort). @onlyann Do you think this is a good approach?

medihack avatar Sep 29 '24 22:09 medihack

Yes, that sounds like a plan.

We could emit a request to abort all remaining jobs after the graceful timeout irrespective of the job being sync or async.

I think you are correct that async jobs may not require that step.

That said, it would still be good to have for consistency and visibility.

An additional timeout (yet to settle on a name) defines how long the worker would wait for jobs to abort.

Not sure what is the best status for jobs that become unresponsive after being requested to abort. Should they become aborted or remain as doing, effectively stalled?

onlyann avatar Oct 01 '24 10:10 onlyann

An additional timeout (yet to settle on a name) defines how long the worker would wait for jobs to abort.

What speaks against sending the abort request just after the shutdown_timeout itself ended (here and here)? Then the documentation with If a shutdown_timeout option is specified, the worker will attempt to abort all jobs that have not completed by that time. Cancelling the run_worker_async task a second time also results in the worker aborting running jobs. would be true again. We can't kill them anyway (as it looks like processes won't be part of the v3 release).

Not sure what is the best status for jobs that become unresponsive after being requested to abort. Should they become aborted or remain as doing, effectively stalled?

We can't really stop that thread, so I suggest leaving the job in the doing state and keeping the worker running. From looking at the code, this is how the worker currently already behaves (is that correct?). I can also add an additional note or warning to the documentation.

medihack avatar Oct 01 '24 21:10 medihack

I agree about keeping the status of such jobs to doing.

As for the abort request, it would be triggered right after the shutdown_timeout.

My comment is about the timespan the worker will wait for jobs to abort (from the moment it requests jobs to abort).

Sync jobs might not acknowledge the request.

Async jobs might handle asyncio.CancelledError to perform cleanup work.

As discussed, this timespan would be configurable, in the same way that shutdown_timeout is.

Let's say this is called shutdown_abort_timeout. Then, the worker would wait when requested to shutdown at most about shutdown_timeout + shutdown_abort_timeout.

This abort timeout could be optional where the default would be that the worker waits indefinitely for jobs to abort.

Any stalled job would cause the worker to be stuck waiting.

This is where having the ability to provide a timeout would allow the worker to complete shutdown, at the expense of stalled jobs still running (as you pointed out, sync jobs running in their own thread cannot be forced to stop).

With all that said, we can also omit this additional timeout until there is enough demand for it, and let the worker wait forever for jobs to abort in the interim.

onlyann avatar Oct 01 '24 22:10 onlyann

Done by #1185

ewjoachim avatar Dec 01 '24 21:12 ewjoachim