gdrcopy Fail to access mapped memory from CPU side(Fail data

Hi there, I am running sanity test and I got this error in the data_validation test:

word  content expected
        13 a5a5a5a5 3f4c7e6a
        14 a5a5a5a5 3f4c1e6a
        15 a5a5a5a5 3f4cde6a
        16 a5a5a5a5 3f4d5e6a
        17 a5a5a5a5 3f4e5e6a
        18 a5a5a5a5 3f485e6a
        19 a5a5a5a5 3f445e6a
        20 a5a5a5a5 3f5c5e6a
        21 a5a5a5a5 3f6c5e6a
        22 a5a5a5a5 3f0c5e6a

I debug by myself and I found that in function: init_hbuf_walking_bit, the buf_ptr cannot be written. The content in it is always 0xffffffff. And the corresponding gpu memory is never changed. (a5a5a5a5) in this case. There are no other errors during the whole process. All function calls return success.

Here is the settings on my server(bare-metal machine): OS: Ubuntu 18.04.6 LTS linux version: 4.15.0-213-generic GPU: Tesla T4 Driver:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:21:00.0 Off |                    0 |
| N/A   29C    P8    14W /  70W |      4MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

$ ofed_info -s
MLNX_OFED_LINUX-5.4-3.4.0.0:

module:

$ lsmod|grep nv
nvidia_peermem         16384  0
nvidia_uvm           1216512  6
ib_core               311296  10 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia_drm             57344  0
nvidia_modeset       1241088  1 nvidia_drm
nvidia              56418304  49 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
drm_kms_helper        172032  2 mgag200,nvidia_drm
drm                   401408  6 drm_kms_helper,nvidia,mgag200,nvidia_drm,ttm

Sep 27 '23 22:09 cxinyic

Hi @cxinyic,

What is your CPU?
How do you connect GPU to CPU (directly to root complex or via a PCIe switch)?
You seem to have a NIC. Have you tried GPUDirect RDMA with your NIC to see if you observe any data corruption?

Sep 28 '23 02:09 pakmarkthub

Hi,

AMD EPYC 7313 16-Core Processor

$ nvidia-smi topo -m
GPU0	NIC0	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	0-15,32-47	0
NIC0	NODE	 X

Yes. I have a ConnectX-5 NIC. My initial goal is to enable gpudirect with RDMA NIC and I hope I can access the GPU memory of the remote server directly through the RDMA NIC. But I have not found any concrete examples of how to use GPUDirect RDMA. So I decided to first test whether it is possible to map the GPU memory to CPU first and I tried gdrcopy. If possible, could you give me some references of some examples about how to use GPUDirect RDMA with the NIC? I have two servers connected by ConnectX-5 NICs.
Does GPUDirect support my current environment settings?

Sep 28 '23 15:09 cxinyic

GPUDirect requires all components in the path to work correctly. May I ask you to check the followings?

What is the GDRCopy version you are using? If it's not v2.4, please upgrade to that.
Which flavor of NVIDIA driver are you using? Is this the opensource or proprietary flavor? You can just post the output of modinfo nvidia here.
Please try gdrcopy_pplat. Does it progress until the end without errors? It should take just a few seconds to finish. Otherwise, it is likely hung.
Please try GPUDirect RDMA with your CX5 NIC. You can use https://github.com/linux-rdma/perftest. Follow the "GPUDirect usage:" instructions in README. Because nvidia_peermem module is loaded on your system, you don't need to use DMABUF.

Sep 28 '23 23:09 pakmarkthub

Hi, thanks a lot for the fast response!!!!

it is v2.4

$ modinfo nvidia
filename:       /lib/modules/4.15.0-213-generic/updates/dkms/nvidia.ko
firmware:       nvidia/525.105.17/gsp_tu10x.bin
firmware:       nvidia/525.105.17/gsp_ad10x.bin
alias:          char-major-195-*
version:        525.105.17
supported:      external
license:        NVIDIA
srcversion:     98F82D76E0EF3952EEE57A7
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
retpoline:      Y
name:           nvidia
vermagic:       4.15.0-213-generic SMP mod_unload modversions

Yes. I have tried this before. It can execute successfully. Here is the output:

GPU id:0; name: Tesla T4; Bus id: 0000:21:00
selecting device 0
Allocated GPU memory at 0x7f9c5f000000
device ptr: 0x7f9c5f000000
gpu alloc fn: cuMemAlloc
map_d_ptr: 0x7f9c87be3000
info.va: 7f9c5f000000
info.mapped_size: 4
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f9c87be3000
CPU does gdr_copy_to_mapping and GPU writes back via cuMemHostAlloc'd buffer.
Running 1000 iterations with data size 4 bytes.
Round-trip latency per iteration is 1.47825 us
unmapping buffer
unpinning buffer
closing gdrdrv

Thanks a lot for this. I just tried the perftest with cuda. It can work well with payload size from 2 to 8. This is the results:

$ ib_read_lat -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 21:00

Picking device No. 0
[pid = 1331, dev = 0] device name = [Tesla T4]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f2603000000 pointer=0x7f2603000000
---------------------------------------------------------------------------------------
                    RDMA_Read Latency Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Outstand reads  : 16
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0032 PSN 0x63f15d OUT 0x10 RKey 0x183de7 VAddr 0x007f2603800000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:08
 remote address: LID 0000 QPN 0x0032 PSN 0xd24e5c OUT 0x10 RKey 0x183ce6 VAddr 0x007f538f800000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:09
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       1000          4.93           10.25        4.98     	       5.03        	0.04   		5.75    		10.25
 4       1000          4.96           10.70        5.01     	       5.01        	0.00   		5.08    		10.70
 8       1000          5.12           8.91         5.18     	       5.18        	0.00   		5.24    		8.91
 Completion with error at client
 Failed status 11: wr_id 0 syndrom 0x89
scnt=1, ccnt=1

I tried with regular pertest(CPU, not cuda) and it can work with all message size. Do you know what might cause this? 5. In perftest, is the data transfer (1)directly from remote GPU memory to local CPU memory or (2)remote GPU memory to local GPU memory or (3)remote GPU to local GPU, than to local CPU? 6. Besides that, since gpudirect rdma requires ofed_info version larger than 4.9. I found that there is no support with tcpdump to capture RoCE packets when the mellanox driver is newer than 4.9. It is hard for me to debug and ibdump only supports infiniband. Do you know what might be a good substitute?

Sep 29 '23 02:09 cxinyic

Based on gdrcopy_pplat, small data seems to work fine. Please run GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 gdrcopy_sanity -v -t data_validation_cumemalloc. That will provide more clue on where the failure is triggered.

Some perftest applications do not work well with CUDA or may require additional environment variables or parameters. Please run ib_write_bw instead. You may need to adapt the command below to your environment.

Server process: ib_write_bw -d mlx5_0 --use_cuda=0 -a -F
Client process: ib_write_bw -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9

Besides that, since gpudirect rdma requires ofed_info version larger than 4.9. I found that there is no support with tcpdump to capture RoCE packets when the mellanox driver is newer than 4.9. It is hard for me to debug and ibdump only supports infiniband. Do you know what might be a good substitute?

For network-related questions, I suggest that you ask in NVIDIA forum or file a bug here.

Sep 29 '23 02:09 pakmarkthub

I run it again. Here is the result:

 $ ./gdrcopy_sanity -v -t data_validation_cumemalloc
&&&& RUNNING data_validation_cumemalloc
buffer size: 327680
Allocated GPU memory at 0x7f04f1000000
DBG:  sse4_1=1 avx=1 sse=1 sse2=1
DBG:  mapping_type=1
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
      word  content expected
         0 a5a5a5a5 3f4c5e6b
         1 a5a5a5a5 3f4c5e68
         2 a5a5a5a5 3f4c5e6e
         3 a5a5a5a5 3f4c5e62
         4 a5a5a5a5 3f4c5e7a
         5 a5a5a5a5 3f4c5e4a
         6 a5a5a5a5 3f4c5e2a
         7 a5a5a5a5 3f4c5eea
         8 a5a5a5a5 3f4c5f6a
         9 a5a5a5a5 3f4c5c6a
check error: 81920 different dwords out of 81920
Assertion "(compare_buf(init_buf, copy_buf, size)) == (0)" failed at sanity.cpp:519
&&&& FAILED data_validation_cumemalloc
Total: 1, Passed: 0, Failed: 1, Waived: 0

List of failed tests:
    data_validation_cumemalloc
Error: Encountered an error or a test failure with status=1

Yes. I also run the bw one. Here is the result:

$ ib_write_bw -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 21:00

Picking device No. 0
[pid = 3848, dev = 0] device name = [Tesla T4]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007fe90f000000 pointer=0x7fe90f000000
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0033 PSN 0xc0a6c2 RKey 0x183de8 VAddr 0x007fe90f800000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:08
 remote address: LID 0000 QPN 0x0033 PSN 0x6bce7a RKey 0x183ce7 VAddr 0x007fa57f800000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:09
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          5000             5.68               4.96   		   2.598011
 4          5000             17.60              16.14  		   4.230363
 8          5000             32.40              30.13  		   3.949843
 Completion with error at client
 Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Sure. I will ask in forum then. Thanks a lot!

Sep 29 '23 03:09 cxinyic

There seems to be an issue when the size is large. gdrcopy_pplat does ping-pong between CPU and GPU at 4 bytes, and it worked fine. On the other hand, gdrcopy_sanity does data validation with 32 KiB and failed. ib_write_bw also failed after 8 bytes. Do you see anything in dmesg?

Sep 29 '23 03:09 pakmarkthub

Hi, I checked the dmesg:

[114826.509844] gdrdrv:gdrdrv_open:minor=0 filep=0xffff8aed944e4f00
[114826.509848] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[114826.509858] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[114826.509859] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f04f1000000 len=327680 p2p_tok=0 va_tok=0)
[114826.509920] gdrdrv:__gdrdrv_pin_buffer:page table entries: 5
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x000002bf40360000
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[1]=0x000002bf40370000
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[2]=0x000002bf40380000
[114826.509922] gdrdrv:__gdrdrv_pin_buffer:page[3]=0x000002bf40390000
[114826.509922] gdrdrv:__gdrdrv_pin_buffer:page[4]=0x000002bf403a0000
[114826.509923] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509924] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509929] gdrdrv:gdrdrv_mmap:mmap filp=0xffff8aed944e4f00 vma=0xffff8aee1dcb9380 vm_file=0xffff8aed944e4f00 start=0x7f04ff7af000 size=327680 off=0x0
[114826.509929] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509930] gdrdrv:gdrdrv_mmap:overwriting vma->vm_private_data=0000000000000000 with mr=ffff8aee022fce40
[114826.509930] gdrdrv:gdrdrv_mmap:range start with p=0 vaddr=7f04ff7af000 page_paddr=2bf40360000
[114826.509931] gdrdrv:gdrdrv_mmap:mapping p=5 entries=5 offset=0 len=327680 vaddr=7f04ff7af000 paddr=2bf40360000
[114826.509932] gdrdrv:gdrdrv_remap_gpu_mem:mmaping phys mem addr=0x2bf40360000 size=327680 at user virt addr=0x7f04ff7af000
[114826.509940] gdrdrv:gdrdrv_mmap:mr vma=0xffff8aee1dcb9380 mapping=0xffff8aee02ffb538
[114826.509942] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509942] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509946] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509946] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.511845] gdrdrv:gdrdrv_vma_close:closing vma=0xffff8aee1dcb9380 vm_file=0xffff8aed944e4f00 vm_private_data=0xffff8aee022fce40 mr=0xffff8aee022fce40 mr->vma=0xffff8aee1dcb9380
[114826.511864] gdrdrv:gdrdrv_release:closing
[114826.511865] gdrdrv:gdrdrv_release:freeing MR=0xffff8aee022fce40
[114826.511865] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f04f1000000 p2p_tok=0 va_tok=0)

Are there any parameters that I should set and may be relevant to the size?

Sep 29 '23 04:09 cxinyic

GPUDirect does not work properly on your system. Unfortunately, there is no clue that can help us identify the root cause. @drossetti Any suggestions?

Sep 29 '23 04:09 pakmarkthub

@cxinyic Is IOMMU enabled? Can you turn it off or set it to passthrough? Then, please try again.

Sep 29 '23 04:09 pakmarkthub

Hi, I already set it to passthrough since I am using the RDMA with amd cpu. This is required by RDMA previously.

$ dmesg|grep iommu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-213-generic root=UUID=4bbb008a-ee68-11ed-bcfa-c45ab19d55ba ro maybe-ubiquity iommu=pt
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-213-generic root=UUID=4bbb008a-ee68-11ed-bcfa-c45ab19d55ba ro maybe-ubiquity iommu=pt

Sep 29 '23 04:09 cxinyic

Hi there, I further did some tests based on that I can only use GPUDirect RDMA with size <= 8 both with perftest and gdrcopy. I run data_validation test in sanity check with different sizes. I found that within check 1: MMIO CPU initialization + read back via cuMemcpy D->H, as long as I add a usleep(0.01); in function void init_hbuf_walking_bit(uint32_t *h_buf, size_t size). It can pass the check 1. However, check 2 can only pass with size <= 8 and check 3 cannot even pass with size <= 8.

Here is my modified function and it can pass check 1 with any size:

void init_hbuf_walking_bit(uint32_t *h_buf, size_t size) {
  uint32_t base_value = 0x3F4C5E6A; // 0xa55ad33d;
  unsigned w;
  ASSERT_NEQ(h_buf, (void *)0);
  ASSERT_EQ(size % 4, 0U);
  // OUT << "filling mem with walking bit " << endl;
  for (w = 0; w < size / sizeof(uint32_t); ++w) {
    
    ((uint32_t*)h_buf)[w] = base_value ^ (1 << (w % 32));
    usleep(0.01);
  }
}

Does that imply that there is no coherence guarantee if simply write sth into the mapped memory? And I need to wait for some time(usleep()) to make sure that the data has been written? Are there any functions that I can call to make sure the data is written?

Sep 29 '23 19:09 cxinyic

Hi @cxinyic,

Sorry, I missed your last comment. I don't recommend you to use GPUDirect RDMA if it is not fully functional. You can easily run to an issue. One problem is silent data corruption, which is difficult to debug in many applications.

If you want to continue with debugging the GPUDirect RDMA issue, I suggest that you file a bug and formally ask for support.

Nov 22 '23 08:11 pakmarkthub

Hi @pakmarkthub,

Thanks so much for your advice. Yes, I found that the GPUDiretct RDMA can work and I have not checked whether the data is corrupted. I will continue debugging this. If you have any other ideas, please tell me.

Nov 22 '23 15:11 cxinyic

Fail to access mapped memory from CPU side(Fail data_validation tests)