MPI.jl icon indicating copy to clipboard operation
MPI.jl copied to clipboard

CUDA synchronization

Open simonbyrne opened this issue 3 years ago • 12 comments

The failure on #439 is due to CUDA synchronization issues, From what I understand, the changes in https://github.com/JuliaGPU/CUDA.jl/pull/395 mean that streams are no longer globally synchronized. Since Open MPI operates its own streams, this means that operations can potentially overlap (https://github.com/open-mpi/ompi/issues/7733).

It seems like we need to call CUDA.synchronize() (or perhaps CUDA.synchronize(CuStreamPerThread())?) before calling MPI?

@vchuravy @maleadt do you have any suggestions/thoughts?

simonbyrne avatar Nov 17 '20 05:11 simonbyrne

CUDA.synchronize() synchronizes the context, so that should do the job. CUDA.synchronize(CuStreamPerThread()) only synchronizes the per-thread one, and I assume MPI doesn't use that stream. Can't you configure MPI to use the CuStreamPerThread stream?

maleadt avatar Nov 17 '20 09:11 maleadt

Yeah we need to synchronize the CUDA stream on which the async memcpy HtoD is launched. So I think the right thing to do is to CUDA.synchronize(CuStreamPerThread()) before MPI ops. After staring at the OpenMPI code I think they are doing it right since they block CPU progress on the Gather until their cuda events have finished.

vchuravy avatar Nov 17 '20 14:11 vchuravy

CUDA.synchronize(CuStreamPerThread())

Don't use the per-stream thread specifically, just use the global one. It automatically 'translates' to the per-thread one, because we use per-thread-specific API calls. You only want to use the per-thread stream object when interfacing with binary libraries, like CUBLAS, that do not use the per-thread-specific API versions.

maleadt avatar Nov 17 '20 15:11 maleadt

Would that still work if we were calling MPI on multiple threads?

simonbyrne avatar Nov 17 '20 16:11 simonbyrne

Would that still work if we were calling MPI on multiple threads?

Depends on which operations you want to make sure have happened. If it's only about operations on that thread, then yes. If not, then you need a global synchronization. Either use synchronize(), which synchronizes the entire context, or (IIUC) call synchronize(CuStreamLegacy()) which performs old-style stream synchronization of the global stream, which will synchronize all other streams (that aren't created with an explicit NONBLOCK flag).

maleadt avatar Nov 17 '20 17:11 maleadt

What I was thinking was something where each thread was creating its own CuArray, executing a kernel and then passing it to MPI:

Threads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream()) # what should this be?
    MPI.Isend(data, dest, tag, comm)
end 

simonbyrne avatar Nov 17 '20 18:11 simonbyrne

CuDefaultStream. All other operations do too, and in the presence of threads that will behind the scenes use a per-thread stream.

maleadt avatar Nov 17 '20 20:11 maleadt

Thanks. So I guess the question then is should we call this by default whenever a CuArray is passed as a buffer? Or leave it to the user?

simonbyrne avatar Nov 17 '20 21:11 simonbyrne

Related comments in: https://github.com/JuliaParallel/MPI.jl/pull/266

I considered this when I contributed code for CUDA-aware MPI, not knowing about the default global synchronization in CUDA.jl. One of the core contributors for this package was against making stream synchronization a default.

kose-y avatar May 25 '21 16:05 kose-y

I think the decision was that we should document it, since there wasn't a clear way to make this work consistently with minimal overhead.

simonbyrne avatar May 25 '21 17:05 simonbyrne

Just as a quick update the right thing to do now is CUDA.synchronize() or to be explicit CUDA.synchronize(CUDA.stream())

vchuravy avatar Apr 16 '22 13:04 vchuravy

I've fixed the tests, we probably still need a more prominent documentation example.

simonbyrne avatar May 31 '22 05:05 simonbyrne