Is having more than one GPU initialized UB in standard cuda-aware MPI?
To be precise is the following code example that maps two gpus to a single MPI rank valid, or is it Undefined Behavior in the standard cuda-aware MPI?
In particular note the part
int otherRank = myrank == 0 ? 1 : 0;
cudaSetDevice(otherRank);
cudaSetDevice(myrank);
Where I set a cuda device that I don't use in that rank before correctly setting the device that I want to map the the MPI rank for MPI invocations. According to e.g. these docs https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#when-do-i-need-to-select-a-cuda-device nothing is specified that tells me this is invalid.
Here is the full example:
#include <cuda_runtime.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int myrank;
float *val_device, *val_host;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
int otherRank = myrank == 0 ? 1 : 0;
cudaSetDevice(otherRank);
cudaSetDevice(myrank);
int num = 1000000;
val_host = (float *)malloc(sizeof(float) * num);
cudaMalloc((void **)&val_device, sizeof(float) * num);
for (int i = 0; i < 1; i++) {
*val_host = -1.0;
if (myrank != 0) {
if (i == 0)
printf("%s %d %s %f\n", "I am rank", myrank,
"and my initial value is:", *val_host);
}
if (myrank == 0) {
*val_host = 42.0;
cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
if (i == 0)
printf("%s %d %s %f\n", "I am rank", myrank,
"and will broadcast value:", *val_host);
}
MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);
if (myrank != 0) {
cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
if (i == 0)
printf("%s %d %s %f\n", "I am rank", myrank,
"and received broadcasted value:", *val_host);
}
}
cudaFree(val_device);
free(val_host);
MPI_Finalize();
return 0;
}
Then execute the above using two MPI ranks on a node with at least two gpus available.
It seems that the above example leads to data leaks for all cuda-aware MPI implementations (cray-mpi, OPENMPI/ MPICH, built with UCX or openfabric (I think, but not certain)), and this isn't specific to e.g. MPI_Bcast, using MPI_Send etc also has the same effect.
I think it might be implied by a few things that I've read that standard MPI does not support multiple gpus to a single rank, and that things like MPI endpoints or nccl (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread) might be used in such a case.
However I cannot find anything explicitly stating this, so it would be good if someone could confirm this?
If not then the above reproducer is a bug report for this case.
Thanks
You are correct, most MPI libraries are optimized for one GPU per process. OMPI might work with multiple GPU, depending on what exactly you are doing, but it will certainly not be optimal (temporary memory or even the MPI Op might be located on the wrong GPU).
However, standard MPI has nothing to do with GPU support, this is a quality of implementation of the MPI libraries (and not of the MPI standard).