zimfarm Cancelling a task is not resilient enough

Cancelling a task is not resilient enough

Open benoit74 opened this issue 7 months ago • 3 comments

When a user or the periodic scheduler requests a task cancellation, the information is stored in DB.

It it then the worker manager responsibility to detect that these tasks have been asked to be removed, and to really cancel them on the worker.

The limitation of this is that when the worker manager is down (e.g. because one task is consuming way too much disks and the worker manager shut itself down, encountered on zimit.kiwix.org when someone tricked the size limit), then nothing can stop the task.

I wonder if the task manager should not itself query the API to check if it should stop himself.

Having both the task manager and the worker manager cancelling tasks has to be assessed for impact, but it should be feasible.

Another concern is that if the task manager is gone, then the worker manager has no idea about how to kill other containers linked to this task. Currently it is the task manager responsibility to properly terminate all containers associated to it. I think we should take into account the scenario where the task manager is gone / not responsive and be able to kill all containers associated to a given task from the worker manager.

Jul 24 '24 09:07 benoit74

zimfarm zimfarm copied to clipboard

Cancelling a task is not resilient enough

zimfarm
zimfarm copied to clipboard