ompi
ompi copied to clipboard
Collective CUDA operations have dangerous pointer operation that could corrupt memory
Thank you for taking the time to submit an issue!
Background information
While reviewing another PR, I noticed that several routines (mca_coll_cuda_scan, mca_coll_cuda_reduce, mca_coll_cuda_allreduce, mca_coll_cuda_reduce_scatter_block, and mca_coll_cuda_exscan) have similar blocks of code:
if ((MPI_IN_PLACE != sbuf) && (opal_cuda_check_bufs((char *)sbuf, NULL))) {
sbuf1 = (char*)malloc(bufsize);
if (NULL == sbuf1) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
opal_cuda_memcpy_sync(sbuf1, sbuf, bufsize);
sbuf2 = sbuf; /* save away original buffer */
sbuf = sbuf1 - gap;
}
If the gap value is > 0, the memory before the malloc'ed buffer will be overwritten.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
The current main branch, as of Sept. 1, 2022
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe the system on which you are running
Details of the problem
(Above)
I don't think this is the case. The gap in the beginning of the datatype is never accessed, this usage is a trick to avoid allocating the entire extent of the datatype and instead only using the true extent.
I don't think this is the case. The gap in the beginning of the datatype is never accessed, this usage is a trick to avoid allocating the entire extent of the datatype and instead only using the true extent.
Well, the 'gap' variable is not used if it's a host memory buffer, but it is modifying the buffer pointers for the GPU memory case. So, it seems like an uncommon case (a GPU memory buffer) that probably still has a problem and the code path (gap > 0) hasn't been accessed yet.