QCFractal Parsl Worker Shutdown Task Failure

Describe the bug If a Parsl worker is killed for any reason we report this back as a task failure when it should likely be restarted. This is especially harsh on interpretable queues.

        raise ManagerLost(manager, self._ready_manager_queue[manager]['hostname'])
    parsl.executors.high_throughput.interchange.ManagerLost: Task failure due to loss of Manager b'55fe9b2bb4f4' on host ca024

To Reproduce Kill a Parsl worker while it is in the middle of evaluating a task a task.

Expected behavior We do want to know when a task repeatedly brings down a manager; however, we typically protect against this by running everything in separate processes making it hard to bring down a worker. At this point I consider the failure label a greater issue than a task that repeatedly brings down a worker.

We could consider an option where we mark a resource as low queue which expects the workers to be continuously killed and the tasks always restarted.

@yadudoc @benclifford Should this be a Parsl issue or something we handle on our end?

Dec 18 '19 23:12 dgasmith

One more poke: @yadudoc @benclifford

Jan 02 '20 20:01 dgasmith

@dgasmith Could you clarify whether you are worried about retrying/restarting tasks or workers ?

If you want tasks to restart you can set the retries option (https://parsl.readthedocs.io/en/stable/userguide/exceptions.html#retries) that retries a task N times before raising an error.

It may not be smart to enable worker auto-restarts because we currently cannot identify why a worker failed (task killed the worker, OOM, failed to reach network. In many of these cases the safer option is for the worker to fail, triggering a job fail, which then comes back to Parsl, rather than risk burning cluster allocation.

Jan 02 '20 20:01 yadudoc

QCFractal QCFractal copied to clipboard

Parsl Worker Shutdown Task Failure

QCFractal
QCFractal copied to clipboard