Fail to access mapped memory from CPU side(Fail data_validation tests)
Hi there, I am running sanity test and I got this error in the data_validation test:
word content expected
13 a5a5a5a5 3f4c7e6a
14 a5a5a5a5 3f4c1e6a
15 a5a5a5a5 3f4cde6a
16 a5a5a5a5 3f4d5e6a
17 a5a5a5a5 3f4e5e6a
18 a5a5a5a5 3f485e6a
19 a5a5a5a5 3f445e6a
20 a5a5a5a5 3f5c5e6a
21 a5a5a5a5 3f6c5e6a
22 a5a5a5a5 3f0c5e6a
I debug by myself and I found that in function: init_hbuf_walking_bit, the buf_ptr cannot be written. The content in it is always 0xffffffff. And the corresponding gpu memory is never changed. (a5a5a5a5) in this case. There are no other errors during the whole process. All function calls return success.
Here is the settings on my server(bare-metal machine): OS: Ubuntu 18.04.6 LTS linux version: 4.15.0-213-generic GPU: Tesla T4 Driver:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:21:00.0 Off | 0 |
| N/A 29C P8 14W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
$ ofed_info -s
MLNX_OFED_LINUX-5.4-3.4.0.0:
module:
$ lsmod|grep nv
nvidia_peermem 16384 0
nvidia_uvm 1216512 6
ib_core 311296 10 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia_drm 57344 0
nvidia_modeset 1241088 1 nvidia_drm
nvidia 56418304 49 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
drm_kms_helper 172032 2 mgag200,nvidia_drm
drm 401408 6 drm_kms_helper,nvidia,mgag200,nvidia_drm,ttm
Hi @cxinyic,
- What is your CPU?
- How do you connect GPU to CPU (directly to root complex or via a PCIe switch)?
- You seem to have a NIC. Have you tried GPUDirect RDMA with your NIC to see if you observe any data corruption?
Hi,
- AMD EPYC 7313 16-Core Processor
$ nvidia-smi topo -m
GPU0 NIC0 CPU Affinity NUMA Affinity
GPU0 X NODE 0-15,32-47 0
NIC0 NODE X
- Yes. I have a ConnectX-5 NIC. My initial goal is to enable gpudirect with RDMA NIC and I hope I can access the GPU memory of the remote server directly through the RDMA NIC. But I have not found any concrete examples of how to use GPUDirect RDMA. So I decided to first test whether it is possible to map the GPU memory to CPU first and I tried gdrcopy. If possible, could you give me some references of some examples about how to use GPUDirect RDMA with the NIC? I have two servers connected by ConnectX-5 NICs.
- Does GPUDirect support my current environment settings?
GPUDirect requires all components in the path to work correctly. May I ask you to check the followings?
- What is the GDRCopy version you are using? If it's not v2.4, please upgrade to that.
- Which flavor of NVIDIA driver are you using? Is this the opensource or proprietary flavor? You can just post the output of
modinfo nvidiahere. - Please try gdrcopy_pplat. Does it progress until the end without errors? It should take just a few seconds to finish. Otherwise, it is likely hung.
- Please try GPUDirect RDMA with your CX5 NIC. You can use https://github.com/linux-rdma/perftest. Follow the "GPUDirect usage:" instructions in README. Because nvidia_peermem module is loaded on your system, you don't need to use DMABUF.
Hi, thanks a lot for the fast response!!!!
- it is v2.4
$ modinfo nvidia
filename: /lib/modules/4.15.0-213-generic/updates/dkms/nvidia.ko
firmware: nvidia/525.105.17/gsp_tu10x.bin
firmware: nvidia/525.105.17/gsp_ad10x.bin
alias: char-major-195-*
version: 525.105.17
supported: external
license: NVIDIA
srcversion: 98F82D76E0EF3952EEE57A7
alias: pci:v000010DEd*sv*sd*bc06sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: drm
retpoline: Y
name: nvidia
vermagic: 4.15.0-213-generic SMP mod_unload modversions
- Yes. I have tried this before. It can execute successfully. Here is the output:
GPU id:0; name: Tesla T4; Bus id: 0000:21:00
selecting device 0
Allocated GPU memory at 0x7f9c5f000000
device ptr: 0x7f9c5f000000
gpu alloc fn: cuMemAlloc
map_d_ptr: 0x7f9c87be3000
info.va: 7f9c5f000000
info.mapped_size: 4
info.page_size: 65536
info.mapped: 1
info.wc_mapping: 1
page offset: 0
user-space pointer: 0x7f9c87be3000
CPU does gdr_copy_to_mapping and GPU writes back via cuMemHostAlloc'd buffer.
Running 1000 iterations with data size 4 bytes.
Round-trip latency per iteration is 1.47825 us
unmapping buffer
unpinning buffer
closing gdrdrv
- Thanks a lot for this. I just tried the perftest with cuda. It can work well with payload size from 2 to 8. This is the results:
$ ib_read_lat -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 21:00
Picking device No. 0
[pid = 1331, dev = 0] device name = [Tesla T4]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f2603000000 pointer=0x7f2603000000
---------------------------------------------------------------------------------------
RDMA_Read Latency Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0032 PSN 0x63f15d OUT 0x10 RKey 0x183de7 VAddr 0x007f2603800000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:08
remote address: LID 0000 QPN 0x0032 PSN 0xd24e5c OUT 0x10 RKey 0x183ce6 VAddr 0x007f538f800000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:09
---------------------------------------------------------------------------------------
#bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99% percentile[usec] 99.9% percentile[usec]
2 1000 4.93 10.25 4.98 5.03 0.04 5.75 10.25
4 1000 4.96 10.70 5.01 5.01 0.00 5.08 10.70
8 1000 5.12 8.91 5.18 5.18 0.00 5.24 8.91
Completion with error at client
Failed status 11: wr_id 0 syndrom 0x89
scnt=1, ccnt=1
I tried with regular pertest(CPU, not cuda) and it can work with all message size. Do you know what might cause this? 5. In perftest, is the data transfer (1)directly from remote GPU memory to local CPU memory or (2)remote GPU memory to local GPU memory or (3)remote GPU to local GPU, than to local CPU? 6. Besides that, since gpudirect rdma requires ofed_info version larger than 4.9. I found that there is no support with tcpdump to capture RoCE packets when the mellanox driver is newer than 4.9. It is hard for me to debug and ibdump only supports infiniband. Do you know what might be a good substitute?
Based on gdrcopy_pplat, small data seems to work fine. Please run GDRCOPY_ENABLE_LOGGING=1 GDRCOPY_LOG_LEVEL=1 gdrcopy_sanity -v -t data_validation_cumemalloc. That will provide more clue on where the failure is triggered.
Some perftest applications do not work well with CUDA or may require additional environment variables or parameters. Please run ib_write_bw instead. You may need to adapt the command below to your environment.
Server process: ib_write_bw -d mlx5_0 --use_cuda=0 -a -F
Client process: ib_write_bw -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9
- Besides that, since gpudirect rdma requires ofed_info version larger than 4.9. I found that there is no support with tcpdump to capture RoCE packets when the mellanox driver is newer than 4.9. It is hard for me to debug and ibdump only supports infiniband. Do you know what might be a good substitute?
For network-related questions, I suggest that you ask in NVIDIA forum or file a bug here.
- I run it again. Here is the result:
$ ./gdrcopy_sanity -v -t data_validation_cumemalloc
&&&& RUNNING data_validation_cumemalloc
buffer size: 327680
Allocated GPU memory at 0x7f04f1000000
DBG: sse4_1=1 avx=1 sse=1 sse2=1
DBG: mapping_type=1
off: 0
check 1: MMIO CPU initialization + read back via cuMemcpy D->H
word content expected
0 a5a5a5a5 3f4c5e6b
1 a5a5a5a5 3f4c5e68
2 a5a5a5a5 3f4c5e6e
3 a5a5a5a5 3f4c5e62
4 a5a5a5a5 3f4c5e7a
5 a5a5a5a5 3f4c5e4a
6 a5a5a5a5 3f4c5e2a
7 a5a5a5a5 3f4c5eea
8 a5a5a5a5 3f4c5f6a
9 a5a5a5a5 3f4c5c6a
check error: 81920 different dwords out of 81920
Assertion "(compare_buf(init_buf, copy_buf, size)) == (0)" failed at sanity.cpp:519
&&&& FAILED data_validation_cumemalloc
Total: 1, Passed: 0, Failed: 1, Waived: 0
List of failed tests:
data_validation_cumemalloc
Error: Encountered an error or a test failure with status=1
- Yes. I also run the bw one. Here is the result:
$ ib_write_bw -d mlx5_0 --use_cuda=0 -a -F 10.1.1.9
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 21:00
Picking device No. 0
[pid = 3848, dev = 0] device name = [Tesla T4]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007fe90f000000 pointer=0x7fe90f000000
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0033 PSN 0xc0a6c2 RKey 0x183de8 VAddr 0x007fe90f800000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:08
remote address: LID 0000 QPN 0x0033 PSN 0x6bce7a RKey 0x183ce7 VAddr 0x007fa57f800000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:01:01:09
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2 5000 5.68 4.96 2.598011
4 5000 17.60 16.14 4.230363
8 5000 32.40 30.13 3.949843
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
- Sure. I will ask in forum then. Thanks a lot!
There seems to be an issue when the size is large. gdrcopy_pplat does ping-pong between CPU and GPU at 4 bytes, and it worked fine. On the other hand, gdrcopy_sanity does data validation with 32 KiB and failed. ib_write_bw also failed after 8 bytes. Do you see anything in dmesg?
Hi, I checked the dmesg:
[114826.509844] gdrdrv:gdrdrv_open:minor=0 filep=0xffff8aed944e4f00
[114826.509848] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008daff)
[114826.509858] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc028da01)
[114826.509859] gdrdrv:__gdrdrv_pin_buffer:invoking nvidia_p2p_get_pages(va=0x7f04f1000000 len=327680 p2p_tok=0 va_tok=0)
[114826.509920] gdrdrv:__gdrdrv_pin_buffer:page table entries: 5
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[0]=0x000002bf40360000
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[1]=0x000002bf40370000
[114826.509921] gdrdrv:__gdrdrv_pin_buffer:page[2]=0x000002bf40380000
[114826.509922] gdrdrv:__gdrdrv_pin_buffer:page[3]=0x000002bf40390000
[114826.509922] gdrdrv:__gdrdrv_pin_buffer:page[4]=0x000002bf403a0000
[114826.509923] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509924] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509929] gdrdrv:gdrdrv_mmap:mmap filp=0xffff8aed944e4f00 vma=0xffff8aee1dcb9380 vm_file=0xffff8aed944e4f00 start=0x7f04ff7af000 size=327680 off=0x0
[114826.509929] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509930] gdrdrv:gdrdrv_mmap:overwriting vma->vm_private_data=0000000000000000 with mr=ffff8aee022fce40
[114826.509930] gdrdrv:gdrdrv_mmap:range start with p=0 vaddr=7f04ff7af000 page_paddr=2bf40360000
[114826.509931] gdrdrv:gdrdrv_mmap:mapping p=5 entries=5 offset=0 len=327680 vaddr=7f04ff7af000 paddr=2bf40360000
[114826.509932] gdrdrv:gdrdrv_remap_gpu_mem:mmaping phys mem addr=0x2bf40360000 size=327680 at user virt addr=0x7f04ff7af000
[114826.509940] gdrdrv:gdrdrv_mmap:mr vma=0xffff8aee1dcb9380 mapping=0xffff8aee02ffb538
[114826.509942] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509942] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.509946] gdrdrv:gdrdrv_ioctl:ioctl called (cmd 0xc008da05)
[114826.509946] gdrdrv:gdr_mr_from_handle_unlocked:mr->handle=0x0 handle=0x0
[114826.511845] gdrdrv:gdrdrv_vma_close:closing vma=0xffff8aee1dcb9380 vm_file=0xffff8aed944e4f00 vm_private_data=0xffff8aee022fce40 mr=0xffff8aee022fce40 mr->vma=0xffff8aee1dcb9380
[114826.511864] gdrdrv:gdrdrv_release:closing
[114826.511865] gdrdrv:gdrdrv_release:freeing MR=0xffff8aee022fce40
[114826.511865] gdrdrv:gdr_free_mr_unlocked:invoking nvidia_p2p_put_pages(va=0x7f04f1000000 p2p_tok=0 va_tok=0)
Are there any parameters that I should set and may be relevant to the size?
GPUDirect does not work properly on your system. Unfortunately, there is no clue that can help us identify the root cause. @drossetti Any suggestions?
@cxinyic Is IOMMU enabled? Can you turn it off or set it to passthrough? Then, please try again.
Hi, I already set it to passthrough since I am using the RDMA with amd cpu. This is required by RDMA previously.
$ dmesg|grep iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-213-generic root=UUID=4bbb008a-ee68-11ed-bcfa-c45ab19d55ba ro maybe-ubiquity iommu=pt
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-213-generic root=UUID=4bbb008a-ee68-11ed-bcfa-c45ab19d55ba ro maybe-ubiquity iommu=pt
Hi there, I further did some tests based on that I can only use GPUDirect RDMA with size <= 8 both with perftest and gdrcopy. I run data_validation test in sanity check with different sizes. I found that within check 1: MMIO CPU initialization + read back via cuMemcpy D->H, as long as I add a usleep(0.01); in function void init_hbuf_walking_bit(uint32_t *h_buf, size_t size). It can pass the check 1. However, check 2 can only pass with size <= 8 and check 3 cannot even pass with size <= 8.
Here is my modified function and it can pass check 1 with any size:
void init_hbuf_walking_bit(uint32_t *h_buf, size_t size) {
uint32_t base_value = 0x3F4C5E6A; // 0xa55ad33d;
unsigned w;
ASSERT_NEQ(h_buf, (void *)0);
ASSERT_EQ(size % 4, 0U);
// OUT << "filling mem with walking bit " << endl;
for (w = 0; w < size / sizeof(uint32_t); ++w) {
((uint32_t*)h_buf)[w] = base_value ^ (1 << (w % 32));
usleep(0.01);
}
}
Does that imply that there is no coherence guarantee if simply write sth into the mapped memory? And I need to wait for some time(usleep()) to make sure that the data has been written? Are there any functions that I can call to make sure the data is written?
Hi @cxinyic,
Sorry, I missed your last comment. I don't recommend you to use GPUDirect RDMA if it is not fully functional. You can easily run to an issue. One problem is silent data corruption, which is difficult to debug in many applications.
If you want to continue with debugging the GPUDirect RDMA issue, I suggest that you file a bug and formally ask for support.
Hi @pakmarkthub,
Thanks so much for your advice. Yes, I found that the GPUDiretct RDMA can work and I have not checked whether the data is corrupted. I will continue debugging this. If you have any other ideas, please tell me.