ginkgo
ginkgo copied to clipboard
Explicitly set device id for each MPI call
This PR abstracts the device guard mechanic from the cuda/hip module and makes it available in other modules. The Executor
gets a new abstract method get_scoped_device_id
(name not finalized) which returns an object that has the same semantics as the older cuda/hip device guards. This means that during the lifetime of the returned object, the device id is set to the one stored in the executor. If that concept does not apply (reference/omp/dpcpp?) then the object has no effect.
In the mpi wrapper, the communicator
now stores an executor. As a result, it is not possible anymore to implicitly convert from MPI_Comm
to communicator
. The window
and request
classes also stores a executor
. Using the stored executor, all MPI calls are guarded to use the correct device id, if necessary.
Note: Currently this is based on distributed-solvers
, but I would rebase it onto distributed-develop
after it has been reviewd to push forward merging the distributed stuff into develop.
format!
format!
format!
As a small comment or to begin a discussion, I think we might need to explain a bit better the Ginkgo communicator
now. Because I think our wrapper deviates from MPI in terms of logic, while it makes perfect sense from an implementation point of view. We are now tying two independent things together, the MPI communicator which is a list of processes (and a shared global view), with the local
device that the user wants the current
process to execute on top of.
There's also the question if we expect these executors to be unique when a device ID is involved within the communicators (process ranks must be unique within a communicator), and maybe a list of other things which might become confusing when tying these two concepts together.
There's also the question if we expect these executors to be unique when a device ID is involved within the communicators (process ranks must be unique within a communicator), and maybe a list of other things which might become confusing when tying these two concepts together.
I don't think we need to impose anything on the device IDs. Most (?) gpus allow you to run multiple mpi processes on the same gpu, but of course with less efficiency. So as long as the device id is valid, the program will run correctly. To run efficiently, the user has to take care of setting the device id correctly. I hope the map_rank_to_device_id
can help with that.
Now that I think of it, we could also add the executor to each communicator method. Then the communicator class would again be just a wrapper, without combining the executor.
@upsj I'm also leaning towards using the executor as a function argument. This will require quite some rework...
I don't think we need to impose anything on the device IDs. Most (?) gpus allow you to run multiple mpi processes on the same gpu, but of course with less efficiency. So as long as the device id is valid, the program will run correctly. To run efficiently, the user has to take care of setting the device id correctly. I hope the
map_rank_to_device_id
can help with that.
I generally agree, I was mostly raising this point as an example of something which isn't obvious when looking at this new combined communicator. from a conceptual point of view.
While that would require some rework, I think that having the executor as an argument to the communicator functions would feel more natural, e.g. when sending or receiving data or synchronizing, it feels natural to me that you might need an object representing the memory location/processor being used.
I've now switched to passing an executor to each MPI call that uses a buffer. For now, I didn't add the executor, and also the device id guard to synchronization MPI calls (barrier, wait, ...). At least for the barrier that shouldn't matter, but I have to test if the wait works correctly.
I've dismissed the reviews because of the change.
Error: The following files need to be formatted:
core/base/mpi.cpp
core/base/scoped_device_id.cpp
core/distributed/vector.cpp
cuda/test/base/scoped_device_id.cu
hip/test/base/scoped_device_id.hip.cpp
include/ginkgo/core/base/combination.hpp
include/ginkgo/core/base/composition.hpp
include/ginkgo/core/base/scoped_device_id.hpp
include/ginkgo/core/matrix/sellp.hpp
include/ginkgo/core/preconditioner/isai.hpp
include/ginkgo/core/preconditioner/jacobi.hpp
include/ginkgo/ginkgo.hpp
test/mpi/distributed/matrix.cpp
test/mpi/solver/solver.cpp
You can find a formatting patch under Artifacts here or run format!
if you have write access to Ginkgo
Note: This PR changes the Ginkgo ABI:
Functions changes summary: 8 Removed, 624 Changed (42832 filtered out), 1196 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
For details check the full ABI diff under Artifacts here
@pratikvn Regarding the performance, I've tested this quite a while ago on crusher, and I didn't experience any notable difference. I've rerun a single benchmark, and it still has the same results, so I would say that this has no negative performance impact.
format!
rebase!
format!
format!
Note: This PR changes the Ginkgo ABI:
Functions changes summary: 8 Removed, 624 Changed (42832 filtered out), 1184 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
For details check the full ABI diff under Artifacts here