gdrcopy icon indicating copy to clipboard operation
gdrcopy copied to clipboard

Assertion "(check_gdr_support(dev)) == (true)" failed

Open schloegl opened this issue 4 years ago • 1 comments

I'm trying to install nvshmem on a machine with several GeForce RTX 3090, running on Debian 10.10, nvidia-cuda-toolkit from backports (cuda 11.2.2). As indicated by the software requirements of nvshmem [1], I've installed MLNX_OFED (v5.3-1.0.0.1 from source), nv_peer_mem (v1.1-0), and GDRcopy (v2.3), and nvshmem (v2.2.1).

The machine has an HDR (200Gbs) Infiniband card :

# lspci |grep -i mell
c2:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

The installation of all packages went fine. But all tests from GDRcopy fail with this kind of error

$ apiperf 
...
selecting device 0
Assertion "(check_gdr_support(dev)) == (true)" failed at apiperf.cpp:262

# copybw
...
selecting device 0
testing size: 131072
rounded size: 131072
Assertion "(check_gdr_support(dev)) == (true)" failed at copybw.cpp:253

When running these tests, syslog reports this kind of error

Jul 7 13:48:56 myhost kernel: [169336.898388] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=7ff512200000 len=65536 p2p_token=0 va_space=0) failed [ret = -22]

Although using nvshmem fails. Trying to run run the nvshmemHelloWorld.cu example fails with
src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed I've also tried to install nvshmem without support for GDRcopy, but that does not change anything, GDRcopy seems to be a strict requirement.

I've also followed the FAQ [2], and confirm that nv_peer_mem and gdrdrv are loaded:


#  lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               352256  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              34119680  198 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset

#  lsmod | grep gdr
gdrdrv                 24576  0
nvidia              34119680  198 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset

Do you have any suggestions what could be wrong and how to solve this ?

[1] https://docs.nvidia.com/hpc-sdk/nvshmem/install-guide/index.html [2] https://docs.nvidia.com/hpc-sdk/nvshmem/api/docs/faq.html

schloegl avatar Jul 07 '21 11:07 schloegl

Hi @schloegl,

GDRCopy relies on the GPUDirect RDMA technology. It needs a Quadro- or Tesla-class GPU. In other words, GeForce is not supported.

I guess from the context that you want GDRCopy because of NVSHMEM. I suggest that you contact them to see if you can install NVSHMEM without GDRCopy or disable it. You can mention that you are using GeForce and they should immediately understand the problem.

pakmarkthub avatar Jul 08 '21 00:07 pakmarkthub