gpunet
gpunet copied to clipboard
Mystery with memory registration
The CPU memory registration (MR) operation in the middle of two GPU MR operations gets slower by 20 times. If target_list is mapped together with CPU send/recv buffers, it takes about 60us. With GPU buffers, it takes 1.4ms. target_list is always mapped to CPU memory, and the difference is too large.
By using the registration cache, this would not be an issue for our system. But I am documenting the issue for other people who might see similar issues. I also have curiosity on why this happens, but not much resource exists to look at this further.
Even with registration cache, the timing says the MR of target_list takes about 1.4ms. That becomes the major overhead for the connections. :(
Hi, GPUDirect is very effective but limited.
First, data transfer using GPUDirect from a GPU to a NIC is much slower than from a NIC to a GPU. This problem is recognized by NVIDIA, but it seems not to be fixed completely. http://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/ https://github.com/NVIDIA/gdrcopy Therefore, some systems using GPUDirect use GPUDirect only from a NIC to a GPU (e.g. MVAPICH [Potluri et al. ICPP '13]).
Furthermore, the mechanism of data transfer using GPUDirect is different from using cudaMemcpyAsync() (of course, cudaMemcpy() either). GPUDirect uses a PCIe facility although general data transfer mechanisms do not use. http://docs.nvidia.com/cuda/gpudirect-rdma/
I think these limitation caused your problem.