zimfarm
zimfarm copied to clipboard
Cancelling a task is not resilient enough
When a user or the periodic scheduler requests a task cancellation, the information is stored in DB.
It it then the worker manager responsibility to detect that these tasks have been asked to be removed, and to really cancel them on the worker.
The limitation of this is that when the worker manager is down (e.g. because one task is consuming way too much disks and the worker manager shut itself down, encountered on zimit.kiwix.org when someone tricked the size limit), then nothing can stop the task.
I wonder if the task manager should not itself query the API to check if it should stop himself.
Having both the task manager and the worker manager cancelling tasks has to be assessed for impact, but it should be feasible.
Another concern is that if the task manager is gone, then the worker manager has no idea about how to kill other containers linked to this task. Currently it is the task manager responsibility to properly terminate all containers associated to it. I think we should take into account the scenario where the task manager is gone / not responsive and be able to kill all containers associated to a given task from the worker manager.