gpunet icon indicating copy to clipboard operation
gpunet copied to clipboard

Mystery with memory registration

Open sangmank opened this issue 9 years ago • 2 comments

The CPU memory registration (MR) operation in the middle of two GPU MR operations gets slower by 20 times. If target_list is mapped together with CPU send/recv buffers, it takes about 60us. With GPU buffers, it takes 1.4ms. target_list is always mapped to CPU memory, and the difference is too large.

By using the registration cache, this would not be an issue for our system. But I am documenting the issue for other people who might see similar issues. I also have curiosity on why this happens, but not much resource exists to look at this further.

sangmank avatar May 06 '15 19:05 sangmank

Even with registration cache, the timing says the MR of target_list takes about 1.4ms. That becomes the major overhead for the connections. :(

sangmank avatar May 06 '15 19:05 sangmank

Hi, GPUDirect is very effective but limited.

First, data transfer using GPUDirect from a GPU to a NIC is much slower than from a NIC to a GPU. This problem is recognized by NVIDIA, but it seems not to be fixed completely. http://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/ https://github.com/NVIDIA/gdrcopy Therefore, some systems using GPUDirect use GPUDirect only from a NIC to a GPU (e.g. MVAPICH [Potluri et al. ICPP '13]).

Furthermore, the mechanism of data transfer using GPUDirect is different from using cudaMemcpyAsync() (of course, cudaMemcpy() either). GPUDirect uses a PCIe facility although general data transfer mechanisms do not use. http://docs.nvidia.com/cuda/gpudirect-rdma/

I think these limitation caused your problem.

rmatsumiya avatar Dec 28 '15 01:12 rmatsumiya