sglang [Bug] [AMD] ncclAllReduce Error under tp > 1 + multi nodes + enable cuda graph when running DeepSeek R1 on MI300X

[Bug] [AMD] ncclAllReduce Error under tp > 1 + multi nodes + enable cuda graph when running DeepSeek R1 on MI300X

Open xinji1 opened this issue 2 weeks ago • 3 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

when running DeepSeek R1 under " tp = 16 + 2 nodes + enable cuda graph" on MI300X, there're ncclAllReduce Errors.

Since disable-cuda-graph will hurt the performance, could this be alleviated without disable-cuda-graph?

Note: use NCCL_DEBUG=TRACE will let all 2 machines hang.

Reproduction

enable the IB in the host machine:

sudo apt-get install -y opensm ibutils infiniband-diags perftest rdmacm-utils ibverbs-utils
sudo apt-get install -y rdma-core
sudo systemctl enable rdma && sudo systemctl start rdma && sudo systemctl enable opensm && sudo systemctl start opensm
systemctl status opensm
sudo systemctl daemon-reload && sudo systemctl restart opensm
systemctl status opensm

start the container:

sudo docker run -it --name=ttt --ipc=host --cap-add=SYS_PTRACE --network=host --device=/dev/kfd -v /mnt:/mnt --device=/dev/dri --security-opt seccomp=unconfined --group-add video --privileged -w /workspace lmsysorg/sglang:v0.4.2.post4-rocm630

enable the IB in the container

apt-get list-installed | grep -i infiniband
apt-get update && apt-get install -y rdma-core ibverbs-utils infiniband-diags

launch the server in two machines:

# node 0, ip = 10.0.0.11
NCCL_NET="IB"  GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path <deepseek-r1> --tp 16 --dist-init-addr 10.0.0.11:30000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 60001

# node 1
NCCL_NET="IB"  GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path <deepseek-r1> --tp 16 --dist-init-addr 10.0.0.11:30000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 60002

error:

Thread 0x0000730492a00640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000732bcd600640 (most recent call first):
  File "/usr/lib/python3.12/threading.py", line 359 in wait
  File "/usr/lib/python3.12/threading.py", line 655 in wait
  File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000736573800640 (most recent call first):
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 55 in _recv_msg
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 191 in _read_thread
  File "/usr/lib/python3.12/threading.py", line 1012 in run
  File "/usr/lib/python3.12/threading.py", line 1075 in _bootstrap_inner                                                                                                                                                                                                                  File "/usr/lib/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x0000736f7d7b6480 (most recent call first):
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl_wrapper.py", line 373 in ncclAllReduce
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/device_communicators/pynccl.py", line 138 in all_reduce
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 412 in _all_reduce_in_place
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1122 in __call__
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce
  File "/sgl-workspace/sglang/python/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 183 in forward
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 774 in forward
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 369 in run_once
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 376 in capture_one_batch_size
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 299 in capture
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 232 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 730 in init_cuda_graphs
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 215 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 68 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 240 in __init__
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1787 in run_scheduler_process
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135 in _main
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122 in spawn_main
  File "<string>", line 1 in <module>
  
  
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, psutil._psutil_posix, setproctitle, zmq.backend.cython._zmq, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, hiredis.hiredis, msgspec._core, sentencepiece._sentencepiece, regex._regex, vllm.utils, vllm.sampling_params, vllm.sequence, roctxMarker, vllm.model_executor.layers.sampler, vllm.core.scheduler, vllm.engine.output_processor.stop_checker, msgpack._cmsgpack, google._upb._message, ray._raylet, vllm.transformers_utils.detokenizer, vllm.outputs, vllm.engine.llm_engine, cython.cimports.libc.math, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, hip_utils, __triton_launcher (total: 106)              Fatal Python error: Segmentation fault

Environment

docker: lmsysorg/sglang:v0.4.2.post4-rocm630

python3 -m sglang.check_env

ROCM available: True                                                                                                                                                                                                                                                                    GPU 0,1,2,3,4,5,6,7: AMD Instinct MI300X VF                                                                                                                                                                                                                                             GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4                                                                                                                                                                                                                                             ROCM_HOME: /opt/rocm                                                                                                                                                                                                                                                                    HIPCC: HIP version: 6.3.42131-fa1d09cbd                                                                                                                                                                                                                                                 ROCM Driver Version: 6.8.5                                                                                                                                                                                                                                                              PyTorch: 2.6.0a0+git8d4926e                                                                                                                                                                                                                                                             sglang: 0.4.2.post4                                                                                                                                                                                                                                                                     sgl_kernel: 0.0.3.post3                                                                                                                                                                                                                                                                 flashinfer: Module Not Found                                                                                                                                                                                                                                                            triton: 3.0.0                                                                                                                                                                                                                                                                           transformers: 4.48.0                                                                                                                                                                                                                                                                    torchao: 0.8.0                                                                                                                                                                                                                                                                          numpy: 1.26.4                                                                                                                                                                                                                                                                           aiohttp: 3.11.11                                                                                                                                                                                                                                                                        fastapi: 0.115.6                                                                                                                                                                                                                                                                        hf_transfer: 0.1.9                                                                                                                                                                                                                                                                      huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0
openai: 1.61.1                                                                                                                                                                                                                                                                          anthropic: 0.45.2
decord: 0.6.0

AMD Topology:


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0
================================== End of ROCm SMI Log ===================================

Hypervisor vendor: X
ulimit soft: 1048576

ibv_devinfo

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:016b
        sys_image_guid:                 3868:dd03:01bd:7000
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2054
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:016c
        sys_image_guid:                 3868:dd03:02bd:7001
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2060
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:016d
        sys_image_guid:                 3868:dd03:03bd:7002
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2055
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:016e
        sys_image_guid:                 3868:dd03:04bd:7003
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2056
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:016f
        sys_image_guid:                 3868:dd03:05bd:7004
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2058
                        port_lmc:               0x00
                        link_layer:             InfiniBand                                                                                                                                                                                                                                                                                        

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:0170
        sys_image_guid:                 3868:dd03:06bd:7005
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1                                                                                                                                                                                                                                                                                       state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2057
                        port_lmc:               0x00
                        link_layer:             InfiniBand 


hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:0171
        sys_image_guid:                 3868:dd03:07bd:7006
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2059
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.40.1702
        node_guid:                      0015:5dff:fe34:0172
        sys_image_guid:                 3868:dd03:08bd:7007
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MSF0000000047
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               2061
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         16.30.1284
        node_guid:                      0022:48ff:fe46:ceb6
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x02c9
        vendor_part_id:                 4122
        hw_ver:                         0x80
        board_id:                       MSF0000000041
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

Feb 17 '25 05:02 xinji1

sglang sglang copied to clipboard

[Bug] [AMD] ncclAllReduce Error under tp > 1 + multi nodes + enable cuda graph when running DeepSeek R1 on MI300X

Checklist

Describe the bug

Reproduction

Environment

sglang
sglang copied to clipboard