test_low_latency.py Failed on 2 node H800
Node1
MASTER_ADDR=WORLD_SIZE=2 RANK=0 python DeepEP/tests/test_low_latency.py
Node2
MASTER_ADDR=WORLD_SIZE=2 RANK=1 python DeepEP/tests/test_low_latency.py
Error message
/nvshmem/src/host/mem/mem_heap.cpp:1509: non-zero status: 2 cuMemCreate failed
/nvshmem/src/host/mem/mem_heap.cpp:1509: non-zero status: 2 cuMemCreate failed
/nvshmem/src/host/mem/mem_heap.cpp:1509: non-zero status: 2 cuMemCreate failed
/nvshmem/src/host/mem/mem_heap.cpp:1509: non-zero status: 2 cuMemCreate failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/nvshmem/src/modules/bootstrap/uid/ncclSocket/ncclsocket_socket.cpp:socketProgress:59: socketProgress: Connection closed by remote peer e01-cn-e4v48e4dl07<46500>
/nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:97: non-zero status: -6 /nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -6 /nvshmem/src/host/mem/mem_heap.cpp:940: non-zero status: -6 allgather of mem handles failed
/nvshmem/src/host/mem/mem_heap.cpp:1098: non-zero status: 7 register heap memory failed
/nvshmem/src/host/mem/mem_heap.cpp:1533: non-zero status: 7 register heap UC memory failed
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/nvshmem/src/modules/bootstrap/uid/ncclSocket/ncclsocket_socket.cpp:socketProgressOpt:39: socketProgressOpt: Call to recv from 172.18.0.177<43129> failed : Broken pipe
/nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:91: non-zero status: -6 /nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:496: non-zero status: -6 /nvshmem/src/host/mem/mem_heap.cpp:940: non-zero status: -6 allgather of mem handles failed
/nvshmem/src/host/mem/mem_heap.cpp:1098: non-zero status: 7 register heap memory failed
/nvshmem/src/host/mem/mem_heap.cpp:1533: non-zero status: 7 register heap UC memory failed
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
/nvshmem/src/modules/bootstrap/uid/ncclSocket/ncclsocket_socket.cpp:socketProgressOpt:39: socketProgressOpt: Call to recv from 172.18.0.177<38567> failed : Broken pipe
/nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:91: non-zero status: -6 /nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:496: non-zero status: -6 /nvshmem/src/host/mem/mem_heap.cpp:940: non-zero status: -6 allgather of mem handles failed
/nvshmem/src/host/mem/mem_heap.cpp:1098: non-zero status: 7 register heap memory failed
/nvshmem/src/host/mem/mem_heap.cpp:1533: non-zero status: 7 register heap UC memory failed
/nvshmem/src/host/mem/mem_heap.cpp:532: non-zero status: 1 cuMemAddressFree failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/host/mem/mem_heap.cpp:532: non-zero status: 1 cuMemAddressFree failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/host/mem/mem_heap.cpp:532: non-zero status: 1 cuMemAddressFree failed
/nvshmem/src/host/mem/mem_heap.cpp:1590: non-zero status: 7 allocate_physical_memory_to_heap failed
/nvshmem/src/modules/bootstrap/uid/ncclSocket/ncclsocket_socket.cpp:socketProgress:59: socketProgress: Connection closed by remote peer e01-cn-e4v48e4dl06<41542>
/nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:97: non-zero status: -6 /nvshmem/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -6 /nvshmem/src/host/mem/mem_heap.cpp:940: non-zero status: -6 allgather of mem handles failed
/nvshmem/src/host/mem/mem_heap.cpp:1098: non-zero status: 7 register heap memory failed
/nvshmem/src/host/mem/mem_heap.cpp:1533: non-zero status: 7 register heap UC memory failed
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
[/nvshmem/src/host/coll/barrier/barrier.cpp:21] cuda failed with an illegal memory access was encountered
/nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed
/nvshmem/src/host/mem/mem_heap.cpp:532: non-zero status: 1 cuMemAddressFree failed
How can I fix this?
Could you try running the nvshmem perftest to see if it works properly?
I met the same problem. This issue is usually caused by insufficient GPU memory. Try to reserve more GPU memory for DeepEP buffer
I met the same problem. This issue is usually caused by insufficient GPU memory. Try to reserve more GPU memory for DeepEP buffer
I met the same problem. This issue is usually caused by insufficient GPU memory. Try to reserve more GPU memory for DeepEP buffer
how to set this?
I met the same problem. This issue is usually caused by insufficient GPU memory. Try to reserve more GPU memory for DeepEP buffer我遇到了同样的问题。此问题通常是由 GPU 内存不足引起的。尝试为 DeepEP 缓冲区保留更多 GPU 内存
how to set this? 如何设置这个
Log in to the physical machine and make sure that the memory on your GPU is not used by other processes.