sglang
sglang copied to clipboard
[Bug] Cuda failure 'invalid argument'
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
Unhandled cuda error during broadcast
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac0491a0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 44 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac0491a0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0063e0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e600000 size 2097152 ipcDesc 0x7f2bac04a970
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04a940 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 45 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04a940 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006458
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0e800000 size 2097152 ipcDesc 0x7f2bac04c110
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04c0e0 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 46 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04c0e0 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac0064d0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ea00000 size 2097152 ipcDesc 0x7f2bac04d8b0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04d880 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO New proxy send connection 47 from local rank 1, transport 0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=1 op.reqBuff=0x7f2bac04d880 op.respSize=16 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO Connected to proxy localRank 1 -> connection 0x7f2bac006548
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Init res=0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Allocated shareable buffer 0xa0ec00000 size 2097152 ipcDesc 0x7f2bac04f050
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO proxyProgressAsync opId=0x7f2bc5ca8ba0 op.type=3 op.reqBuff=0x7f2bac04f020 op.respSize=80 done
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ncclPollProxyResponse Received new opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO resp.opId=0x7f2bc5ca8ba0 matches expected opId=0x7f2bc5ca8ba0
6f747114300646eb-5af0524bda8e4aee:4168:5290 [1] NCCL INFO Received and initiated operation=Setup res=0
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] transport/p2p.cc:275 NCCL WARN Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:330 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport/p2p.cc:460 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO transport.cc:165 -> 1
6f747114300646eb-5af0524bda8e4aee:4168:5271 [1] NCCL INFO ProxyCall UDS comm 0x55b2e3c71350 rank 1 tpRank 0(b7a7bd7508f9845d) reqSize 8 respSize 0 respFd 0x7f2be0fd7ca8 opId 0xa3cc92762de46676
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1263 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO init.cc:1548 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5269 [0] NCCL INFO group.cc:64 -> 1 [Async thread]
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO group.cc:418 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:4167 [0] NCCL INFO init.cc:1929 -> 1
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO proxyUDSRecvReq::ncclProxyMsgGetFd rank 1 opId 0xa3cc92762de46676 handle=0x564e1ead6750
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO UDS proxyGetFd received handle 0x564e1ead6750 peer 1 opId a3cc92762de46676
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] proxy.cc:1341 NCCL WARN Cuda failure 1 'invalid argument'
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 115, in <module>
[rank0]: test_sglang_spmd()
[rank0]: File "/mnt/amlfs-01/home/jingwang/LEARNS/learn_verl_0512/verl/tests/workers/rollout/test_sglang_async_spmd.py", line 97, in test_sglang_spmd
[rank0]: [outputs] = broadcast_pyobj(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 893, in broadcast_pyobj
[rank0]: dist.broadcast(tensor_size, src=src, group=dist_group)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank0]: work = group.broadcast([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid argument'
6f747114300646eb-5af0524bda8e4aee:4167:5289 [0] NCCL INFO [Proxy Service UDS] exit: stop 0 abortFlag 1
[rank0]:[W513 05:03:06.244537893 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0513 05:03:07.672000 4162 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4168 closing signal SIGTERM
E0513 05:03:08.292000 4162 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 4167) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test_sglang_async_spmd.py FAILED
Reproduction
- Use nvcr.io/nvidia/pytorch:24.08-py3 as base image
- Follow the exact setup process in verl
- git clone verl,
cd verl/tests/workers/rollout NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes=1 --nproc_per_node=2 test_sglang_async_spmd.py
Environment
Python: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.20
CUDA Driver Version: 550.127.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post1
sgl_kernel: 0.1.0
flashinfer: Module Not Found
triton: 3.2.0
transformers: 4.51.1
torchao: 0.11.0
numpy: 1.24.4
aiohttp: 3.10.1
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.31.1
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 23.2
psutil: 6.0.0
pydantic: 2.8.2
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.17
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.51.0
litellm: 1.69.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS NODE 0-47 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS NODE 0-47 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS NODE 0-47 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS NODE 0-47 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS NODE NODE NODE NODE SYS 48-95 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE NODE NODE NODE SYS 48-95 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE NODE NODE SYS 48-95 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE NODE SYS 48-95 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS NODE
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS NODE
NIC2 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NODE
NIC3 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NODE
NIC4 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS
NIC5 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS
NIC6 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X NODE SYS
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE X SYS
NIC8 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
Hypervisor vendor: Microsoft
ulimit soft: 1048576