dask-pytorch-ddp
dask-pytorch-ddp copied to clipboard
`dispatch.run` is not resilient to worker loss
dispatch.run uses worker-restrictions to pin tasks to the workers they should get executed on. Should a worker get removed (or possibly restarted), this will cause the task to transition to the no-worker state and remain there indefinitely (see https://github.com/dask/distributed/issues/7346). From what I see, there is no mechanism implemented to prevent this.
To circumvent this, dask-pytorch-ddp would probably also benefit from https://github.com/dask/distributed/issues/8624.