sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] Cuda failure 'invalid argument'

Open jingwangsg opened this issue 7 months ago • 0 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [x] 5. Please use English, otherwise it will be closed.

Describe the bug

Unhandled cuda error during broadcast

6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac0491a0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 44 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac0491a0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0063e0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e600000 size 2097152 ipcDesc 0x7f2bac04a970
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04a940 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 45 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04a940 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006458
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e800000 size 2097152 ipcDesc 0x7f2bac04c110
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04c0e0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 46 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04c0e0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0064d0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ea00000 size 2097152 ipcDesc 0x7f2bac04d8b0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04d880 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 47 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04d880 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006548
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ec00000 size 2097152 ipcDesc 0x7f2bac04f050
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04f020 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0

6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] transport/p2p.cc:275 NCCL WARN Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:330 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:460 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport.cc:165 -> 1
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ProxyCall UDS comm 0x55b2e3c71350 rank 1 tpRank 0(b7a7bd7508f9845d) reqSize 8 respSize 0 respFd 0x7f2be0fd7ca8 opId 0xa3cc92762de46676
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1263 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1548 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO group.cc:418 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO init.cc:1929 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO proxyUDSRecvReq::ncclProxyMsgGetFd rank 1 opId 0xa3cc92762de46676 handle=0x564e1ead6750
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO UDS proxyGetFd received handle 0x564e1ead6750 peer 1 opId a3cc92762de46676

6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] proxy.cc:1341 NCCL WARN Cuda failure 1 'invalid argument'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 115, in <module>
[rank0]:     test_sglang_spmd()
[rank0]:   File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 97, in test_sglang_spmd
[rank0]:     [outputs] = broadcast_pyobj(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 893, in broadcast_pyobj
[rank0]:     dist.broadcast(tensor_size, src=src, group=dist_group)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]:     work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
[rank0]:[W513 05:03:06.244537893 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0513 05:03:07.672000 4162 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4168 closing signal SIGTERM
E0513 05:03:08.292000 4162 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 4167) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test_sglang_async_spmd.py FAILED

Reproduction

  1. Use nvcr.io/nvidia/pytorch:24.08-py3 as base image
  2. Follow the exact setup process in verl
  3. git clone verl, cd verl/tests/workers/rollout
  4. NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes=1 --nproc_per_node=2 test_sglang_async_spmd.py

Environment

Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.20
CUDA Driver Version: 550.127.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post1
sgl_kernel: 0.1.0
flashinfer: Module Not Found
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0
numpy: 1.24.4
aiohttp: 3.10.1
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.31.1
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 23.2
psutil: 6.0.0
pydantic: 2.8.2
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.17
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.51.0
litellm: 1.69.0
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-47    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     48-95   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     48-95   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     48-95   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     48-95   1               N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS
NIC8    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


Hypervisor vendor: Microsoft
ulimit soft: 1048576

jingwangsg avatar May 12 '25 21:05 jingwangsg