QCFractal
QCFractal copied to clipboard
Parsl Worker Shutdown Task Failure
Describe the bug If a Parsl worker is killed for any reason we report this back as a task failure when it should likely be restarted. This is especially harsh on interpretable queues.
raise ManagerLost(manager, self._ready_manager_queue[manager]['hostname'])
parsl.executors.high_throughput.interchange.ManagerLost: Task failure due to loss of Manager b'55fe9b2bb4f4' on host ca024
To Reproduce Kill a Parsl worker while it is in the middle of evaluating a task a task.
Expected behavior We do want to know when a task repeatedly brings down a manager; however, we typically protect against this by running everything in separate processes making it hard to bring down a worker. At this point I consider the failure label a greater issue than a task that repeatedly brings down a worker.
We could consider an option where we mark a resource as low queue which expects the workers to be continuously killed and the tasks always restarted.
@yadudoc @benclifford Should this be a Parsl issue or something we handle on our end?
One more poke: @yadudoc @benclifford
@dgasmith Could you clarify whether you are worried about retrying/restarting tasks or workers ?
If you want tasks to restart you can set the retries option (https://parsl.readthedocs.io/en/stable/userguide/exceptions.html#retries) that retries a task N times before raising an error.
It may not be smart to enable worker auto-restarts because we currently cannot identify why a worker failed (task killed the worker, OOM, failed to reach network. In many of these cases the safer option is for the worker to fail, triggering a job fail, which then comes back to Parsl, rather than risk burning cluster allocation.