sglang
sglang copied to clipboard
[Bug] [AMD] ncclAllReduce Error under tp > 1 + multi nodes + enable cuda graph when running DeepSeek R1 on MI300X
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
when running DeepSeek R1 under " tp = 16 + 2 nodes + enable cuda graph" on MI300X, there're ncclAllReduce Errors.
Since disable-cuda-graph
will hurt the performance, could this be alleviated without disable-cuda-graph
?
Note: use NCCL_DEBUG=TRACE
will let all 2 machines hang.
Reproduction
- enable the IB in the host machine:
sudo apt-get install -y opensm ibutils infiniband-diags perftest rdmacm-utils ibverbs-utils
sudo apt-get install -y rdma-core
sudo systemctl enable rdma && sudo systemctl start rdma && sudo systemctl enable opensm && sudo systemctl start opensm
systemctl status opensm
sudo systemctl daemon-reload && sudo systemctl restart opensm
systemctl status opensm
- start the container:
sudo docker run -it --name=ttt --ipc=host --cap-add=SYS_PTRACE --network=host --device=/dev/kfd -v /mnt:/mnt --device=/dev/dri --security-opt seccomp=unconfined --group-add video --privileged -w /workspace lmsysorg/sglang:v0.4.2.post4-rocm630
- enable the IB in the container
apt-get list-installed | grep -i infiniband
apt-get update && apt-get install -y rdma-core ibverbs-utils infiniband-diags
- launch the server in two machines:
# node 0, ip = 10.0.0.11
NCCL_NET="IB" GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path <deepseek-r1> --tp 16 --dist-init-addr 10.0.0.11:30000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 60001
# node 1
NCCL_NET="IB" GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path <deepseek-r1> --tp 16 --dist-init-addr 10.0.0.11:30000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 60002
- error:
Thread 0x0000730492a00640 (most recent call first):
File "/usr/lib/python3.12/threading.py", line 359 in wait
File "/usr/lib/python3.12/threading.py", line 655 in wait
File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x0000732bcd600640 (most recent call first):
File "/usr/lib/python3.12/threading.py", line 359 in wait
File "/usr/lib/python3.12/threading.py", line 655 in wait
File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x0000736573800640 (most recent call first):
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 55 in _recv_msg
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 191 in _read_thread
File "/usr/lib/python3.12/threading.py", line 1012 in run
File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap
Current thread 0x0000736f7d7b6480 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 373 in ncclAllReduce
File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 138 in all_reduce
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 412 in _all_reduce_in_place
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce
File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1122 in __call__
File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce
File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 183 in forward
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 774 in forward
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 369 in run_once
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 376 in capture_one_batch_size
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 299 in capture
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 232 in __init__
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 730 in init_cuda_graphs
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 215 in __init__
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68 in __init__
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63 in __init__
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240 in __init__
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1787 in run_scheduler_process
File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
File "<string>", line 1 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, hiredis.hiredis, msgspec._core, sentencepiece._sentencepiece, regex._regex, vllm.utils, vllm.sampling_params, vllm.sequence, roctxMarker, vllm.model_executor.layers.sampler, vllm.core.scheduler, vllm.engine.output_processor.stop_checker, msgpack._cmsgpack, google._upb._message, ray._raylet, vllm.transformers_utils.detokenizer, vllm.outputs, vllm.engine.llm_engine, cython.cimports.libc.math, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, hip_utils, __triton_launcher (total: 106) Fatal Python error: Segmentation fault
Environment
docker: lmsysorg/sglang:v0.4.2.post4-rocm630
python3 -m sglang.check_env
ROCM available: True GPU 0,1,2,3,4,5,6,7: AMD Instinct MI300X VF GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4 ROCM_HOME: /opt/rocm HIPCC: HIP version: 6.3.42131-fa1d09cbd ROCM Driver Version: 6.8.5 PyTorch: 2.6.0a0+git8d4926e sglang: 0.4.2.post4 sgl_kernel: 0.0.3.post3 flashinfer: Module Not Found triton: 3.0.0 transformers: 4.48.0 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.11 fastapi: 0.115.6 hf_transfer: 0.1.9 huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0
openai: 1.61.1 anthropic: 0.45.2
decord: 0.6.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
Hypervisor vendor: X
ulimit soft: 1048576
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:016b
sys_image_guid: 3868:dd03:01bd:7000
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2054
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:016c
sys_image_guid: 3868:dd03:02bd:7001
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2060
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:016d
sys_image_guid: 3868:dd03:03bd:7002
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2055
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:016e
sys_image_guid: 3868:dd03:04bd:7003
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2056
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:016f
sys_image_guid: 3868:dd03:05bd:7004
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2058
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_5
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:0170
sys_image_guid: 3868:dd03:06bd:7005
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1 state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2057
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_6
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:0171
sys_image_guid: 3868:dd03:07bd:7006
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2059
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_7
transport: InfiniBand (0)
fw_ver: 28.40.1702
node_guid: 0015:5dff:fe34:0172
sys_image_guid: 3868:dd03:08bd:7007
vendor_id: 0x02c9
vendor_part_id: 4126
hw_ver: 0x0
board_id: MSF0000000047
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2061
port_lmc: 0x00
link_layer: InfiniBand
hca_id: mlx5_8
transport: InfiniBand (0)
fw_ver: 16.30.1284
node_guid: 0022:48ff:fe46:ceb6
sys_image_guid: 0000:0000:0000:0000
vendor_id: 0x02c9
vendor_part_id: 4122
hw_ver: 0x80
board_id: MSF0000000041
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet