Kevin Paul
Kevin Paul
> Yes, I can ignore the errors for now. I'm working on instructions for scientists using our local HPC resources to farm out work to the cluster using dask-mpi, and...
Sigh. @lgarrison: I've tracked down the errors to something beyond Dask-MPI. You can test if you are seeing the same thing as me, but I'm seeing `CommClosedErrors` during client shutdowns...
I'm opening an upstream issue right now. I'm seeing if I can figure out with Dask version introduced the regression. Then I'll submit the issue and report it here.
There is definitely some strangeness produced by Python's `async` functions, here. When tests are run in `dask-mpi`, for example, and an `async` error occurs at shutdown, the process still ends...
Ok. The Dask Distributed issue has been created (dask/distributed#7192). I'm not sure how much more I want to work on Dask-MPI until I hear back about that issue, lest I...
> Interesting, thanks! I was also able to reproduce this without dask-mpi following your instructions. I confirm that `client.retire_workers()` avoids the error, but when I add it to my reproduction...
(My thinking is that with a large number of workers, the `retire_workers` call could take quite a while to complete.)
I do not know how to initialize an MPI environment without starting a new process. Every MPI implementation is different, and so every `mpirun`/`mpiexec` does something different when executed. Its...
@lgarrison: Yes. Dask-MPI is just a convenience function. As you point out, MPI is _not_ used for client-scheduler-worker communication. MPI is only used to communicate the scheduler address to the...
I think it becomes possible, but I think we would need some OpenMPI or MPICH developers to chime in. Maybe an issue here: https://github.com/open-mpi/ompi?