Kevin Paul

Results 107 comments of Kevin Paul

> Yes, I can ignore the errors for now. I'm working on instructions for scientists using our local HPC resources to farm out work to the cluster using dask-mpi, and...

Sigh. @lgarrison: I've tracked down the errors to something beyond Dask-MPI. You can test if you are seeing the same thing as me, but I'm seeing `CommClosedErrors` during client shutdowns...

I'm opening an upstream issue right now. I'm seeing if I can figure out with Dask version introduced the regression. Then I'll submit the issue and report it here.

There is definitely some strangeness produced by Python's `async` functions, here. When tests are run in `dask-mpi`, for example, and an `async` error occurs at shutdown, the process still ends...

Ok. The Dask Distributed issue has been created (dask/distributed#7192). I'm not sure how much more I want to work on Dask-MPI until I hear back about that issue, lest I...

> Interesting, thanks! I was also able to reproduce this without dask-mpi following your instructions. I confirm that `client.retire_workers()` avoids the error, but when I add it to my reproduction...

(My thinking is that with a large number of workers, the `retire_workers` call could take quite a while to complete.)

I do not know how to initialize an MPI environment without starting a new process. Every MPI implementation is different, and so every `mpirun`/`mpiexec` does something different when executed. Its...

@lgarrison: Yes. Dask-MPI is just a convenience function. As you point out, MPI is _not_ used for client-scheduler-worker communication. MPI is only used to communicate the scheduler address to the...

I think it becomes possible, but I think we would need some OpenMPI or MPICH developers to chime in. Maybe an issue here: https://github.com/open-mpi/ompi?