Assertion "(check_gdr_support(dev)) == (true)" failed
I'm trying to install nvshmem on a machine with several GeForce RTX 3090, running on Debian 10.10, nvidia-cuda-toolkit from backports (cuda 11.2.2). As indicated by the software requirements of nvshmem [1], I've installed MLNX_OFED (v5.3-1.0.0.1 from source), nv_peer_mem (v1.1-0), and GDRcopy (v2.3), and nvshmem (v2.2.1).
The machine has an HDR (200Gbs) Infiniband card :
# lspci |grep -i mell
c2:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
The installation of all packages went fine. But all tests from GDRcopy fail with this kind of error
$ apiperf
...
selecting device 0
Assertion "(check_gdr_support(dev)) == (true)" failed at apiperf.cpp:262
# copybw
...
selecting device 0
testing size: 131072
rounded size: 131072
Assertion "(check_gdr_support(dev)) == (true)" failed at copybw.cpp:253
When running these tests, syslog reports this kind of error
Jul 7 13:48:56 myhost kernel: [169336.898388] gdrdrv:__gdrdrv_pin_buffer:nvidia_p2p_get_pages(va=7ff512200000 len=65536 p2p_token=0 va_space=0) failed [ret = -22]
Although using nvshmem fails. Trying to run run the nvshmemHelloWorld.cu example fails with
src/comm/transports/ibrc/ibrc.cpp: NULL value get_device_list failed
I've also tried to install nvshmem without support for GDRcopy, but that does not change anything, GDRcopy seems to be a strict requirement.
I've also followed the FAQ [2], and confirm that nv_peer_mem and gdrdrv are loaded:
# lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 352256 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 34119680 198 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset
# lsmod | grep gdr
gdrdrv 24576 0
nvidia 34119680 198 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset
Do you have any suggestions what could be wrong and how to solve this ?
[1] https://docs.nvidia.com/hpc-sdk/nvshmem/install-guide/index.html [2] https://docs.nvidia.com/hpc-sdk/nvshmem/api/docs/faq.html
Hi @schloegl,
GDRCopy relies on the GPUDirect RDMA technology. It needs a Quadro- or Tesla-class GPU. In other words, GeForce is not supported.
I guess from the context that you want GDRCopy because of NVSHMEM. I suggest that you contact them to see if you can install NVSHMEM without GDRCopy or disable it. You can mention that you are using GeForce and they should immediately understand the problem.