[Bug] torch.distributed.all_reduce raised Segmentation fault on 2 * 8 * H800

Open YEXINGZHE54 opened this issue 10 months ago • 2 comments

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

node 1 server Log:

Fatal Python error: Segmentation fault

Thread 0x00007f2e93fff640 (most recent call first): File "/XXXX/sglang/python/sglang/srt/managers/scheduler.py", line 462 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f2ea0870640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2501 in all_reduce File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper File "/XXXX/sglang/python/sglang/srt/distributed/parallel_state.py", line 414 in _all_reduce_in_place File "/XXXX/sglang/python/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in call File "/XXXX/sglang/python/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce File "/XXXX/sglang/python/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce File "/XXXX/sglang/python/sglang/srt/models/deepseek_v2.py", line 183 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/XXXX/sglang/python/sglang/srt/models/deepseek_v2.py", line 787 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/XXXX/sglang/python/sglang/srt/models/deepseek_v2.py", line 835 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/XXXX/sglang/python/sglang/srt/models/deepseek_v2.py", line 874 in forward File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116 in decorate_context File "/XXXX/sglang/python/sglang/srt/model_executor/model_runner.py", line 781 in forward_idle File "/XXXX/sglang/python/sglang/srt/model_executor/model_runner.py", line 798 in forward File "/XXXX/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation File "/XXXX/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/XXXX/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f313cffd640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3ee6fc5640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4440b27480 (most recent call first): File "/usr/lib/python3.10/threading.py", line 320 in wait File "/usr/lib/python3.10/queue.py", line 171 in get File "/XXXX/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 169 in resolve_batch_result File "/XXXX/sglang/python/sglang/srt/managers/scheduler.py", line 1123 in process_batch_result File "/XXXX/sglang/python/sglang/srt/managers/scheduler.py", line 519 in event_loop_overlap File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/XXXX/sglang/python/sglang/srt/managers/scheduler.py", line 1825 in run_scheduler_process File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main File "", line 1 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, psutil._psutil_linux, psutil._psutil_posix, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, uvloop.loop, setproctitle, yaml._yaml, markupsafe._speedups, PIL._imaging, PIL._imagingft, msgspec._core, msgpack._cmsgpack, google._upb._message, ray._raylet, sentencepiece._sentencepiece, regex._regex, cuda_utils, __triton_launcher (total: 52) worker-0:981198:984678 [6] NCCL INFO [Service thread] Connection closed by localRank 5

Reproduction

run latest slang code(1eb8eade2bf6f69bf38c7d2706242775842131c5) with DeepSeek-R1(671B) on 28H800, and run benchmark with gps 0.5, after a moment, you will see Segmentation fault error in server log of node(rank = 1).

node 0 server:

#!/bin/bash NCCL_IB_DISABLE=0 python3 -m sglang.launch_server --watchdog-timeout 36000 --dist-init-addr dlc15h6z9j4v4llh-master-0:20000 --model-path /mnt/models/DeepSeek-R1 --nnodes 2 --node-rank 0 --log-level debug --port 18005 --context-length 12288 --chunked-prefill-size 8192 --tp 16 --dp 1 --schedule-policy random --load-balance-method round_robin --trust-remote-code --enable-dp-attention \

node 1 server:

#!/bin/bash NCCL_IB_DISABLE=0 python3 -m sglang.launch_server --watchdog-timeout 36000 --dist-init-addr dlc15h6z9j4v4llh-master-0:20000 --model-path /mnt/models/DeepSeek-R1 --nnodes 2 --node-rank 1 --log-level debug --port 18005 --context-length 12288 --chunked-prefill-size 8192 --tp 16 --dp 1 --schedule-policy random --load-balance-method round_robin --trust-remote-code --enable-dp-attention \

benchmark by calling http://127.0.0.1:18005/v1/completion every 2 seconds (qps = 0.5)

Environment

aiohappyeyeballs==2.4.6 aiohttp==3.11.12 aiohttp-cors==0.7.0 aiosignal==1.3.2 airportsdata==20241001 annotated-types==0.7.0 anthropic==0.45.2 anyio==4.8.0 argcomplete==3.5.3 astor==0.8.1 asttokens==3.0.0 async-timeout==5.0.1 attrs==25.1.0 black==25.1.0 blake3==1.0.4 blinker==1.4 cachetools==5.5.1 certifi==2025.1.31 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 colorful==0.5.6 compressed-tensors==0.9.1 cryptography==3.4.8 cuda-bindings==12.8.0 cuda-python==12.8.0 datamodel-code-generator==0.27.3 dbus-python==1.2.18 decorator==5.1.1 decord==0.6.0 depyf==0.18.0 dill==0.3.9 diskcache==5.6.3 distlib==0.3.9 distro==1.7.0 distro-info==1.1+ubuntu0.2 einops==0.8.1 exceptiongroup==1.2.2 executing==2.2.0 fastapi==0.115.8 filelock==3.17.0 flashinfer-python==0.2.1.post2+cu124torch2.5 frozenlist==1.5.0 fsspec==2024.6.1 genson==1.3.0 gguf==0.10.0 google-api-core==2.24.1 google-auth==2.38.0 googleapis-common-protos==1.67.0 grpcio==1.70.0 h11==0.14.0 hf_transfer==0.1.9 html5lib==1.1 httpcore==1.0.7 httplib2==0.20.2 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.28.1 idna==3.10 importlib_metadata==8.6.1 inflect==5.6.2 iniconfig==2.0.0 interegular==0.3.3 ipython==8.32.0 isort==6.0.0 jedi==0.19.2 jeepney==0.7.1 Jinja2==3.1.5 jiter==0.8.2 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 keyring==23.5.0 lark==1.2.2 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 litellm==1.61.2 lm-format-enforcer==0.10.9 loguru==0.7.3 markdown-it-py==3.0.0 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mistral_common==1.5.3 modelscope==1.22.3 more-itertools==8.10.0 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.3 ninja==1.11.1.3 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 oauthlib==3.2.0 openai==1.63.0 opencensus==0.11.4 opencensus-context==0.1.3 opencv-python-headless==4.11.0.86 orjson==3.10.15 outlines==0.1.11 outlines_core==0.1.26 packaging==24.2 pandas==2.2.3 parso==0.8.4 partial-json-parser==0.2.1.1.post5 pathspec==0.12.1 pexpect==4.9.0 pillow==11.1.0 platformdirs==4.3.6 pluggy==1.5.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 prompt_toolkit==3.0.50 propcache==0.2.1 proto-plus==1.26.0 protobuf==5.29.3 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 py-spy==0.4.0 pyasn1==0.6.1 pyasn1_modules==0.4.1 pybind11==2.13.6 pycountry==24.6.1 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 PyGObject==3.42.1 PyJWT==2.3.0 pyparsing==2.4.7 pytest==8.3.4 python-apt==2.4.0+ubuntu4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.2.1 ray==2.42.1 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.22.3 rsa==4.9 safetensors==0.5.2 SecretStorage==3.3.1 sentencepiece==0.2.0 setproctitle==1.3.4 sgl-kernel==0.0.3.post6 -e git+https://github.com/sgl-project/sglang.git@1eb8eade2bf6f69bf38c7d2706242775842131c5#egg=sglang&subdirectory=python shellingham==1.5.4 six==1.17.0 smart-open==7.1.0 sniffio==1.3.1 ssh-import-id==5.11 stack-data==0.6.3 starlette==0.45.3 sympy==1.13.1 tiktoken==0.8.0 tokenizers==0.21.0 tomli==2.2.1 torch==2.5.1 torchao==0.8.0 torchaudio==2.5.1 torchvision==0.20.1 tqdm==4.67.1 traitlets==5.14.3 transformers==4.48.3 triton==3.1.0 typer==0.15.1 typing_extensions==4.12.2 tzdata==2025.1 unattended-upgrades==0.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 virtualenv==20.29.2 vllm==0.7.2 wadllib==1.3.6 watchfiles==1.0.4 wcwidth==0.2.13 webencodings==0.5.1 websockets==14.2 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.10 yarl==1.18.3 zipp==3.21.0

Feb 21 '25 06:02 YEXINGZHE54