dask-pytorch-ddp icon indicating copy to clipboard operation
dask-pytorch-ddp copied to clipboard

`dispatch.run` is not resilient to worker loss

Open hendrikmakait opened this issue 1 year ago • 0 comments

dispatch.run uses worker-restrictions to pin tasks to the workers they should get executed on. Should a worker get removed (or possibly restarted), this will cause the task to transition to the no-worker state and remain there indefinitely (see https://github.com/dask/distributed/issues/7346). From what I see, there is no mechanism implemented to prevent this.

To circumvent this, dask-pytorch-ddp would probably also benefit from https://github.com/dask/distributed/issues/8624.

hendrikmakait avatar Jun 11 '24 12:06 hendrikmakait