ompi
ompi copied to clipboard
Question about how OMPI handles OFI HMEM and completion semantics
Hey, guys.
A question came up about how the OMPI OFI MTL and BTL handle ensuring that CUDA/HMEM buffers are completely in sync at the end of a data transfer. Looking at the code, the BTL requests FI_DELIVERY_COMPLETE which is a straightforward answer, but the MTL does not - how does OMPI handle the risk that the data transfer could be complete from the CPU's point of view but still not flushed to the GPU's memory?
@open-mpi/ofi This seems like a question for you guys, as well as those discussing the accelerator framework.
I'm not sure FI_DELIVERY_COMPLETE is important to the completion semantics for HMEM buffers. FI_DELIVERY_COMPLETE is requested by the BTL because of one-sided completion semantics, not the HMEM bits.
The short answer is that peer to peer transactions and network completions is a mess, and really depend on the accelerator in use. For Nvidia GPUs, the answer is that someone has to set the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
on registered memory. Right now, the assumption is that the Libfabric provider will do that (which mirrors what UCX does), which likely makes sense from a completion semantics point of view. Without setting that attribute, the provider can't really make any statements about data deliver. But I'm sure we could also make an argument than OMPI should be setting the attribute (but it does not today).
@bwbarrett
According to @jdinan in this discussion https://github.com/aws/aws-ofi-nccl/pull/152, it seems that both CU_POINTER_ATTRIBUTE_SYNC_MEMOPS and flush are needed to ensure data consistency.
I think it is reasonable to expect libfabric to set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS (given UCX set it)
Should Open MPI issue the flush?
Should Open MPI issue the flush?
So, that's kind of the question I'm struggling with. For PSM3, we've been assuming we were doing sufficient work to maintain consistency and Libfabric has no mechanism for doing a flush operation. But we encountered the NCCL library which tries to force a 0-length RDMA read to the localhost via fi_read() which doesn't work over PSM3, at least not the way NCCL is opening libfabric.
I've been told, internally, that it is the provider's job to do this zero-length read but, in that case, why is NCCL trying to do it itself?
That comment does not match what UCX does, nor the CUDA documentation.
That comment does not match what UCX does, nor the CUDA documentation.
Which part?
The need for a flushing read. NCCL needs a flushing read because it doesn't follow https://docs.nvidia.com/cuda/gpudirect-rdma/index.html. I believe we expect MPI applications to follow the host cuda call rules of that doc, and Libfabric/MPI to set the SYNC_MEMOPS attribute on receive buffers. I did not find a flushing read in UCX source code, but I'd be happy for someone to point it out to me.
As Brian said, MPI applications follow the GDR rules above, so CUDA deals with the GPU's memory model gotchas for you. NCCL bends some of these rules -- notably, it has a kernel already running on the GPU at the time data is written to device memory via GPUDirect RDMA. NVIDIA GPUs don't reflect PCIe write after write ordering to the SM (WAW order is obeyed for PCIe accesses to GPU memory). The PCIe read tells the GPU to make the kernel's view of memory consistent with the PCIe view before a NCCL thread on the CPU sets a flag (PCIe write) telling the kernel that it's safe to read the data. Yes, this is a weird side effect from a PCIe read. We added a new API called cudaDeviceFlushGPUDirectRDMAWrites
that applications can use to get this consistency. However, this API is blocking and an RDMA read can be nonblocking (as in NCCL iFlush) and nonblocking flush can help hide the overhead of the memory fence.
CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
is an entirely different problem that comes from CUDA optimizations that make memcpy/memset asynchronous. This is based on an assumption that all access will come through CUDA and can be ordered with respect to async memory operations. This assumption is violated by GPUDirect RDMA, so in a sequence like (1) cudaMemSet(buf,...) (2) MPI_Recv(buf, ...) there is a race between the memset and the recv. Forcing memops to be synchronous fixes this race.
@jdinan - thanks. You've made the whole thing so much clearer for me. Have you looked at https://github.com/aws/aws-ofi-nccl/pull/152? The reason I ask is that the NCCL maintainers are claiming problems with cudaDeviceFlushGPUDirectRDMAWrites on some network stacks.
@mwheinz Yes, I'm involved in this work. There's an issue in the libfabric stack that we haven't been able to identify. We aren't able to reproduce the cudaDeviceFlushGPUDirectRDMAWrites hang with the InfiniBand NCCL network module. Would be glad to work with you on debugging this, as we don't have a good setup to reproduce it in house.