Add `IReduce!` and `IAllreduce!`
Added basic nonblocking reductions, alongside with some tests.
Mmm... only the CUDA tests fail. I feared this was a Julia-side issue, but no, it is simply that MPI_Iallreduce is not CUDA-aware in OpenMPI.
This is a known issue (https://github.com/open-mpi/ompi/issues/9845), and it also happens with MPI_Ireduce (https://github.com/open-mpi/ompi/issues/12045).
I was also able to reproduce it in C.
It is very surprising to me that the ROCm support apparently covers all non-blocking ops, but not CUDA.
What would be the best course of action? Merge anyway and let users stumble upon an unhelpful segfault? Or would a warning (if using OpenMPI + CUDA is loaded) be enough?
Yeah we don't currently have a good mechanism to declare which operations can and can not take GPU memory. It seems even worse for OpenMPI since the set of supported operations is depending on whether or not UCX is used.
We certainly need to branch in the tests, but I don't think we have prior art for this.
@simonbyrne any ideas?
Unfortunately it is probably implementation (and configuration) dependent, so I don't think we can provide a complete solution. My best suggestion would be to make it so the test suite can soft fail and report which operations are supported.
If you want something easy that does work, the simplest option is to use the regular blocking operation spawned on a separate thread:
task = Threads.@spawn MPI.Allreduce(....)
# other work
wait(task)
If your other work involves MPI ops, you will also need to MPI.Init(threadlevel=:multiple).