ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

Explicitly set device id for each MPI call

Open MarcelKoch opened this issue 2 years ago • 12 comments

This PR abstracts the device guard mechanic from the cuda/hip module and makes it available in other modules. The Executor gets a new abstract method get_scoped_device_id (name not finalized) which returns an object that has the same semantics as the older cuda/hip device guards. This means that during the lifetime of the returned object, the device id is set to the one stored in the executor. If that concept does not apply (reference/omp/dpcpp?) then the object has no effect.

In the mpi wrapper, the communicator now stores an executor. As a result, it is not possible anymore to implicitly convert from MPI_Comm to communicator. The window and request classes also stores a executor. Using the stored executor, all MPI calls are guarded to use the correct device id, if necessary.

Note: Currently this is based on distributed-solvers, but I would rebase it onto distributed-develop after it has been reviewd to push forward merging the distributed stuff into develop.

MarcelKoch avatar Jun 14 '22 15:06 MarcelKoch

format!

MarcelKoch avatar Jun 14 '22 15:06 MarcelKoch

format!

MarcelKoch avatar Jul 19 '22 08:07 MarcelKoch

format!

MarcelKoch avatar Aug 05 '22 12:08 MarcelKoch

As a small comment or to begin a discussion, I think we might need to explain a bit better the Ginkgo communicator now. Because I think our wrapper deviates from MPI in terms of logic, while it makes perfect sense from an implementation point of view. We are now tying two independent things together, the MPI communicator which is a list of processes (and a shared global view), with the local device that the user wants the current process to execute on top of.

There's also the question if we expect these executors to be unique when a device ID is involved within the communicators (process ranks must be unique within a communicator), and maybe a list of other things which might become confusing when tying these two concepts together.

tcojean avatar Aug 24 '22 12:08 tcojean

There's also the question if we expect these executors to be unique when a device ID is involved within the communicators (process ranks must be unique within a communicator), and maybe a list of other things which might become confusing when tying these two concepts together.

I don't think we need to impose anything on the device IDs. Most (?) gpus allow you to run multiple mpi processes on the same gpu, but of course with less efficiency. So as long as the device id is valid, the program will run correctly. To run efficiently, the user has to take care of setting the device id correctly. I hope the map_rank_to_device_id can help with that.

MarcelKoch avatar Aug 24 '22 15:08 MarcelKoch

Now that I think of it, we could also add the executor to each communicator method. Then the communicator class would again be just a wrapper, without combining the executor.

MarcelKoch avatar Aug 24 '22 15:08 MarcelKoch

@upsj I'm also leaning towards using the executor as a function argument. This will require quite some rework...

MarcelKoch avatar Aug 24 '22 19:08 MarcelKoch

I don't think we need to impose anything on the device IDs. Most (?) gpus allow you to run multiple mpi processes on the same gpu, but of course with less efficiency. So as long as the device id is valid, the program will run correctly. To run efficiently, the user has to take care of setting the device id correctly. I hope the map_rank_to_device_id can help with that.

I generally agree, I was mostly raising this point as an example of something which isn't obvious when looking at this new combined communicator. from a conceptual point of view.

While that would require some rework, I think that having the executor as an argument to the communicator functions would feel more natural, e.g. when sending or receiving data or synchronizing, it feels natural to me that you might need an object representing the memory location/processor being used.

tcojean avatar Aug 25 '22 09:08 tcojean

I've now switched to passing an executor to each MPI call that uses a buffer. For now, I didn't add the executor, and also the device id guard to synchronization MPI calls (barrier, wait, ...). At least for the barrier that shouldn't matter, but I have to test if the wait works correctly.

I've dismissed the reviews because of the change.

MarcelKoch avatar Aug 25 '22 14:08 MarcelKoch

Error: The following files need to be formatted:

core/base/mpi.cpp
core/base/scoped_device_id.cpp
core/distributed/vector.cpp
cuda/test/base/scoped_device_id.cu
hip/test/base/scoped_device_id.hip.cpp
include/ginkgo/core/base/combination.hpp
include/ginkgo/core/base/composition.hpp
include/ginkgo/core/base/scoped_device_id.hpp
include/ginkgo/core/matrix/sellp.hpp
include/ginkgo/core/preconditioner/isai.hpp
include/ginkgo/core/preconditioner/jacobi.hpp
include/ginkgo/ginkgo.hpp
test/mpi/distributed/matrix.cpp
test/mpi/solver/solver.cpp

You can find a formatting patch under Artifacts here or run format! if you have write access to Ginkgo

ginkgo-bot avatar Sep 20 '22 14:09 ginkgo-bot

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 8 Removed, 624 Changed (42832 filtered out), 1196 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

ginkgo-bot avatar Sep 20 '22 15:09 ginkgo-bot

@pratikvn Regarding the performance, I've tested this quite a while ago on crusher, and I didn't experience any notable difference. I've rerun a single benchmark, and it still has the same results, so I would say that this has no negative performance impact.

MarcelKoch avatar Sep 26 '22 13:09 MarcelKoch

format!

MarcelKoch avatar Sep 28 '22 15:09 MarcelKoch

rebase!

MarcelKoch avatar Sep 28 '22 15:09 MarcelKoch

format!

MarcelKoch avatar Sep 30 '22 08:09 MarcelKoch

format!

MarcelKoch avatar Oct 04 '22 13:10 MarcelKoch

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 8 Removed, 624 Changed (42832 filtered out), 1184 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

ginkgo-bot avatar Oct 04 '22 14:10 ginkgo-bot