ucx
ucx copied to clipboard
Invalid Device Context and Seg Fault with UCX+MPI+PyTorch
Describe the bug
I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:
- change UCX_TLS environment variable
- explicitly setting
torch.cuda.set_device(rank)
- reinstalling UCX with gdrcopy (which seems to not be recognized when running ucx_info -d)
I'm not sure what would be going wrong and would greatly appreciate assistance here!
Error message and stack trace:
[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0] cuda_copy_md.c:341 UCX ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0:768657] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f5b05e00000)
==== backtrace (tid: 768657) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5c1eae82b4]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x304af) [0x7f5c1eae84af]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x30796) [0x7f5c1eae8796]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72) [0x7f605c51ca72]
4 /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93) [0x7f5c1ea959e3]
5 /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d) [0x7f5c1ebade9d]
6 /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735) [0x7f5c1ebb9365]
7 /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab) [0x7f5c2403627b]
8 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b) [0x7f605bb4582b]
9 /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2) [0x7f605bb45ed2]
10 /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40) [0x7f5c1deb0840]
11 /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41) [0x7f605bb20841]
12 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4) [0x7f600c8168a4]
13 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2) [0x7f600c81d3d2]
14 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5fb34b0253]
15 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f605c40aac3]
16 /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f605c49ca40]
=================================
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** Process received signal ***
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal: Segmentation fault (11)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal code: (-6)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Failing at address: 0xbb417
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f605c3b8520]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72)[0x7f605c51ca72]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93)[0x7f5c1ea959e3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d)[0x7f5c1ebade9d]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 4] /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735)[0x7f5c1ebb9365]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab)[0x7f5c2403627b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f605bb4582b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x7f605bb45ed2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 8] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f5c1deb0840]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f605bb20841]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4)[0x7f600c8168a4]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2)[0x7f600c81d3d2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [12] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5fb34b0253]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f605c40aac3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f605c49ca40]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** End of error message ***
Steps to Reproduce
- ran
mpirun --allow-run-as-root -np 8 python myscript.py
- UCX version used: 1.15.0
- UCX configure flags:
--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
-
Any UCX environment variables used:
UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp
Setup and versions
- OS version + CPU architecture
- Ubuntu 22.04.3 LTS on x86_64 GNU/Linux
- For GPU related issues:
- GPU type: H100-80GB
- Cuda:
- Drivers version: 535.86.10
- Check if peer-direct is loaded: The
lsmod|grep gdrdrv
gives me:gdrdrv 24576 0
nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
- OpenMPI version: 4.1.5rc2
- Output of
ucx_info -d
to show transports and devices recognized by UCX:
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# rkey_ptr is supported
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 19360.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: tcp
# Device: eth0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 1129.60/ppn + 0.00 MB/sec
# latency: 5258 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 990751528K
# remote key: 32 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 16 bytes
# error handling: ep_check
#
#
# Memory domain: cuda_cpy
# Component: cuda_cpy
# allocate: unlimited
# register: unlimited, cost: 0 nsec
# memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#
# Transport: cuda_copy
# Device: cuda
# Type: accelerator
# System device: <unknown>
#
# capabilities:
# bandwidth: 10000.00/ppn + 0.00 MB/sec
# latency: 8000 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_short: <= 4294967295
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: cuda_ipc
# Component: cuda_ipc
# register: unlimited, cost: 0 nsec
# remote key: 112 bytes
# memory invalidation is supported
# memory types: cuda (access,reg,cache)
#
# Transport: cuda_ipc
# Device: cuda
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 400000.00/ppn + 0.00 MB/sec
# latency: 1000 nsec
# overhead: 7000 nsec
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 16 bytes
# error handling: peer failure, ep_check
#
can you please also post the error itself?
Knew I was forgetting something :) updated the description above!
Do you think this may be due to using RoCE?
@snarayan21 Can you post the output of ucx_info -v
?
Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:
[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0] cuda_copy_md.c:341 UCX ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.
Here's the output of ucx_info -v
:
# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37
I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html