gdrcopy icon indicating copy to clipboard operation
gdrcopy copied to clipboard

Failed run sanity test in docker container.

Open myanzhang opened this issue 3 years ago • 4 comments

docker container envir: centos 7.2 cuda: 11.0 gpu: A100-SXM4-40GB when I run sanity test, meet 2 failures as pic: 企业微信截图_16467983167868

But the copybw / copylat / apiperf tests is ok! 企业微信截图_16467986165047 企业微信截图_16467987035247 企业微信截图_16467987369786

Are there any suggestions for finding out the reasons? Thanks a lot for your help!

myanzhang avatar Mar 09 '22 04:03 myanzhang

Hi @myanzhang ,

Can you run sanity -v and post the output?

pakmarkthub avatar Mar 09 '22 04:03 pakmarkthub

[root@ts-6ab12923e4f84b41a1dec977fcf2a978-launcher ~/gdrcopy/tests]# ./sanity -v Running suite(s): Sanity &&&& RUNNING basic_cumemalloc buffer size: 327680 &&&& PASSED basic_cumemalloc &&&& RUNNING basic_with_tokens buffer size: 327680 &&&& PASSED basic_with_tokens &&&& RUNNING basic_unaligned_mapping First allocation: d_fa=0x7f1263200000, size=4 Second allocation: d_A=0x7f1263220200, size=65540, GPU-page-boundary 0x7f1263220000 d_A is unaligned Try mapping d_A as is. Mapping d_A failed as expected. Align d_A and try mapping it again. Pin and map aligned address: d_aligned_A=0x7f1263230000, offset=65024, size=516 &&&& PASSED basic_unaligned_mapping &&&& RUNNING basic_child_thread_pins_buffer_cumemalloc spawning single child thread pinning Assertion "(gdr_pin_buffer(pt->g, pt->d_buf, pt->size, 0, 0, &pt->mh)) == (0)" failed at sanity.cpp:1751 &&&& FAILED basic_child_thread_pins_buffer_cumemalloc &&&& RUNNING basic_vmmalloc buffer size: 327680 &&&& PASSED basic_vmmalloc &&&& RUNNING basic_child_thread_pins_buffer_vmmalloc spawning single child thread pinning Assertion "(gdr_pin_buffer(pt->g, pt->d_buf, pt->size, 0, 0, &pt->mh)) == (0)" failed at sanity.cpp:1751 &&&& FAILED basic_child_thread_pins_buffer_vmmalloc &&&& RUNNING data_validation_cumemalloc buffer size: 327680 off: 0 check 1: MMIO CPU initialization + read back via cuMemcpy D->H check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar() check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset warning: buffer size 327669 is not dword aligned, ignoring trailing bytes unmapping unpinning &&&& PASSED data_validation_cumemalloc &&&& RUNNING data_validation_vmmalloc buffer size: 327680 off: 0 check 1: MMIO CPU initialization + read back via cuMemcpy D->H check 2: gdr_copy_to_bar() + read back via cuMemcpy D->H check 3: gdr_copy_to_bar() + read back via gdr_copy_from_bar() check 4: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 5 dwords offset check 5: gdr_copy_to_bar() + read back via gdr_copy_from_bar() + 11 bytes offset warning: buffer size 327669 is not dword aligned, ignoring trailing bytes unmapping unpinning &&&& PASSED data_validation_vmmalloc &&&& RUNNING invalidation_access_after_gdr_close_cumemalloc Mapping bar1 Writing 254 into buf_ptr[0] Calling gdr_close Trying to read buf_ptr[0] after gdr_close Get signal 7 as expected &&&& PASSED invalidation_access_after_gdr_close_cumemalloc &&&& RUNNING invalidation_access_after_free_cumemalloc Mapping bar1 Writing 269 into buf_ptr[0] Calling gpuMemFree Trying to read buf_ptr[0] after gpuMemFree Get signal 7 as expected &&&& PASSED invalidation_access_after_free_cumemalloc &&&& RUNNING invalidation_two_mappings_cumemalloc Mapping bar1 Writing data to both mappings 954 and 955 respectively Validating that we can read the data back gpuMemFree and thus destroying the first mapping Trying to read and validate the data from the second mapping after the first mapping has been destroyed &&&& PASSED invalidation_two_mappings_cumemalloc &&&& RUNNING invalidation_fork_access_after_free_cumemalloc parent: Start child: Start child: waiting for cont signal from parent parent: writing buf_ptr[0] with 689 parent: read buf_ptr[0] before gpuMemFree get 689 parent: calling gpuMemFree parent: waiting for child write signal child: receive cont signal 1 from parent child: writing buf_ptr[0] with 699 child: signal parent that I have written child: waiting for signal from parent before calling gpuMemFree parent: trying to read buf_ptr[0] Get signal 7 as expected &&&& PASSED invalidation_fork_access_after_free_cumemalloc &&&& RUNNING invalidation_fork_after_gdr_map_cumemalloc parent: Start parent: writing buf_ptr[0] with 557 parent: trying to read buf_ptr[0] parent: read buf_ptr[0] get 557 parent: signaling child parent: waiting for child to exit child: Start child: waiting for cont signal from parent child: receive cont signal 1 from parent child: trying to read buf_ptr[0] Get signal 11 as expected parent: trying to read buf_ptr[0] after child exits parent: read buf_ptr[0] after child exits get 557 &&&& PASSED invalidation_fork_after_gdr_map_cumemalloc &&&& RUNNING invalidation_fork_child_gdr_map_parent_cumemalloc parent: Start child: Start child: attempting to gdr_map parent's pinned GPU memory child: cannot do gdr_map as expected &&&& PASSED invalidation_fork_child_gdr_map_parent_cumemalloc &&&& RUNNING invalidation_fork_map_and_free_cumemalloc parent: Start child: Start child: writing buf_ptr[0] with 305 child: calling gpuMemFree child: signal parent that I have called gpuMemFree parent: writing buf_ptr[0] with 305 parent: waiting for signal from child parent: received cont signal 1 from child parent: trying to read buf_ptr[0] parent: read buf_ptr[0] get 305 &&&& PASSED invalidation_fork_map_and_free_cumemalloc &&&& RUNNING invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc parent: Start child: Start child: Receiving fd from parent via unix socket parent: Calling gdr_open parent: Extracted fd from gdr_t got fd 4 parent: Sending fd to child via unix socket parent: Waiting for child to finish child: Got fd 5 child: Converting fd to gdr_t child: Trying to do gdr_pin_buffer with the received fd child: Cannot do gdr_pin_buffer with the received fd as expected &&&& PASSED invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc &&&& RUNNING invalidation_unix_sock_shared_fd_gdr_map_cumemalloc parent: Start child: Start child: Receiving fd from parent via unix socket parent: Calling gdr_open parent: Calling gdr_pin_buffer parent: Extracted fd from gdr_t got fd 8 parent: Sending fd to child via unix socket parent: Extracted gdr_memh_t from gdr_mh_t got handle 0x0 parent: Sending gdr_memh_t to child parent: Waiting for child to finish child: Got fd 9 child: Converting fd to gdr_t child: Receiving gdr_memh_t from parent child: Got handle 0x0 child: Converting gdr_memh_t to gdr_mh_t child: Attempting gdr_map child: Cannot do gdr_map as expected &&&& PASSED invalidation_unix_sock_shared_fd_gdr_map_cumemalloc &&&& RUNNING invalidation_fork_child_gdr_pin_parent_with_tokens parent: Start child: Start parent: CUDA generated tokens.p2pToken 0, tokens.vaSpaceToken 65024 child: Received from parent tokens.p2pToken 0, tokens.vaSpaceToken 65024 &&&& PASSED invalidation_fork_child_gdr_pin_parent_with_tokens &&&& RUNNING invalidation_access_after_gdr_close_vmmalloc Mapping bar1 Writing 646 into buf_ptr[0] Calling gdr_close Trying to read buf_ptr[0] after gdr_close Get signal 7 as expected &&&& PASSED invalidation_access_after_gdr_close_vmmalloc &&&& RUNNING invalidation_access_after_free_vmmalloc Mapping bar1 Writing 766 into buf_ptr[0] Calling gpuMemFree Trying to read buf_ptr[0] after gpuMemFree Get signal 7 as expected &&&& PASSED invalidation_access_after_free_vmmalloc &&&& RUNNING invalidation_two_mappings_vmmalloc Mapping bar1 Writing data to both mappings 32 and 33 respectively Validating that we can read the data back gpuMemFree and thus destroying the first mapping Trying to read and validate the data from the second mapping after the first mapping has been destroyed &&&& PASSED invalidation_two_mappings_vmmalloc &&&& RUNNING invalidation_fork_access_after_free_vmmalloc parent: Start child: Start child: waiting for cont signal from parent parent: writing buf_ptr[0] with 310 parent: read buf_ptr[0] before gpuMemFree get 310 parent: calling gpuMemFree parent: waiting for child write signal child: receive cont signal 1 from parent child: writing buf_ptr[0] with 320 child: signal parent that I have written child: waiting for signal from parent before calling gpuMemFree parent: trying to read buf_ptr[0] Get signal 7 as expected &&&& PASSED invalidation_fork_access_after_free_vmmalloc &&&& RUNNING invalidation_fork_after_gdr_map_vmmalloc parent: Start parent: writing buf_ptr[0] with 669 parent: trying to read buf_ptr[0] parent: read buf_ptr[0] get 669 parent: signaling child parent: waiting for child to exit child: Start child: waiting for cont signal from parent child: receive cont signal 1 from parent child: trying to read buf_ptr[0] Get signal 11 as expected parent: trying to read buf_ptr[0] after child exits parent: read buf_ptr[0] after child exits get 669 &&&& PASSED invalidation_fork_after_gdr_map_vmmalloc &&&& RUNNING invalidation_fork_child_gdr_map_parent_vmmalloc parent: Start child: Start child: attempting to gdr_map parent's pinned GPU memory child: cannot do gdr_map as expected &&&& PASSED invalidation_fork_child_gdr_map_parent_vmmalloc &&&& RUNNING invalidation_fork_map_and_free_vmmalloc parent: Start child: Start child: writing buf_ptr[0] with 757 child: calling gpuMemFree child: signal parent that I have called gpuMemFree parent: writing buf_ptr[0] with 387 parent: waiting for signal from child parent: received cont signal 1 from child parent: trying to read buf_ptr[0] parent: read buf_ptr[0] get 387 &&&& PASSED invalidation_fork_map_and_free_vmmalloc &&&& RUNNING invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc parent: Start child: Start parent: Calling gdr_open parent: Extracted fd from gdr_t got fd 4 parent: Sending fd to child via unix socket parent: Waiting for child to finish child: Receiving fd from parent via unix socket child: Got fd 5 child: Converting fd to gdr_t child: Trying to do gdr_pin_buffer with the received fd child: Cannot do gdr_pin_buffer with the received fd as expected &&&& PASSED invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc &&&& RUNNING invalidation_unix_sock_shared_fd_gdr_map_vmmalloc parent: Start child: Start parent: Calling gdr_open parent: Calling gdr_pin_buffer parent: Extracted fd from gdr_t got fd 8 parent: Sending fd to child via unix socket parent: Extracted gdr_memh_t from gdr_mh_t got handle 0x0 parent: Sending gdr_memh_t to child parent: Waiting for child to finish child: Receiving fd from parent via unix socket child: Got fd 9 child: Converting fd to gdr_t child: Receiving gdr_memh_t from parent child: Got handle 0x0 child: Converting gdr_memh_t to gdr_mh_t child: Attempting gdr_map child: Cannot do gdr_map as expected &&&& PASSED invalidation_unix_sock_shared_fd_gdr_map_vmmalloc 92%: Checks: 27, Failures: 2, Errors: 0 sanity.cpp:1856:F:Basic:basic_child_thread_pins_buffer_cumemalloc:0: Failed sanity.cpp:1862:F:Basic:basic_child_thread_pins_buffer_vmmalloc:0: Failed

@pakmarkthub Here is the output info, thanks!

PS: How do I install and use gdrcopy for tests?

  1. Install gdrcpy on the physical machine, and map gdrdrv inside the container. The container shows the following: 1646802825(1)
  2. Install gdrcopy library in container.
    cd ~/gdrcopy/ && sudo make lib_install Then, In the /usr/local/lib dir have .so files. 企业微信截图_16468030325682 3)sanity test. cd ~/gdrcopy/tests && make Then, perform test.

myanzhang avatar Mar 09 '22 05:03 myanzhang

Thank you for the info. What is the gdrdrv version you are using? For libgdrapi, can I ask for confirmation that it is version 2.3?

Based on your post, I guess that you have sudo access to the physical machine. Can you do the followings?

  1. On your physical machine, change this line https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to sudo /sbin/insmod src/gdrdrv/gdrdrv.ko dbg_enabled=1 info_enabled=1.
  2. Do make driver && sudo ./insmod.sh.
  3. On your container, run ./sanity -v.
  4. Can you post the output from dmesg from your physical machine here? I want to see the output during the sanity run. You don't need to post the whole log if you don't want to.

pakmarkthub avatar Mar 09 '22 05:03 pakmarkthub

@pakmarkthub Thanks, I will reply when I confirm.

myanzhang avatar Mar 09 '22 06:03 myanzhang