sglang
sglang copied to clipboard
[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [X] 5. Please use English, otherwise it will be closed.
Describe the bug
Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3) ==== backtrace (tid: 212877) ====
0 0x0000000000042520 __sigaction() ???:0 1 0x0000000000049b8a ncclMemoryPoolAlloc<ncclProxyOp>() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:280
2 0x0000000000049b8a addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180 3 0x0000000000049b8a addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176
4 0x000000000004c496 addCBDCollToPlan() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481
5 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844
6 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260
7 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
8 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
9 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
10 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
11 0x000000000004d74f ncclEnqueueCheck() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
12 0x0000000000044b36 ncclAllGather() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:26
13 0x00000000011fd1f3 c10d::ProcessGroupNCCL::_allgather_base() ???:0
14 0x0000000005f8e9b8 c10d::ops::(anonymous namespace)::_allgather_base_CUDA() Ops.cpp:0
15 0x0000000005f985cc c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_defa
ult_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at:
:Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10:
:detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call() :0
16 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0
17 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0 18 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0
19 0x0000000005f9fc2e c10::impl::BoxedKernelWrapper<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call() :0
20 0x0000000005fabfe8 c10d::ProcessGroup::_allgather_base() :0 21 0x0000000000df6c7e pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup
, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name con
st&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at:
:Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybin
d11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybin
d11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name co
nst&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11
::detail::function_call&)#3}::_FUN() :0
22 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0
23 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0
24 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
25 0x0000000000168acb PyMethod_New() ???:0
26 0x0000000000148cfa _PyEval_EvalFrameDefault() ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
28 0x0000000000169492 PyObject_Call() ???:0
29 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
30 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
31 0x000000000014453c _PyEval_EvalFrameDefault() ???:0
32 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
33 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
34 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
35 0x000000000014326d _PyEval_EvalFrameDefault() ???:0
36 0x000000000016893e PyMethod_New() ???:0
37 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 38 0x000000000016893e PyMethod_New() ???:0
39 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
40 0x000000000014fc14 _PyObject_FastCallDictTstate() ???:0
41 0x000000000016586c _PyObject_Call_Prepend() ???:0
42 0x0000000000280700 PyInit__datetime() ???:0
43 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
44 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0
45 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
46 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
47 0x00000000001687f1 PyMethod_New() ???:0
48 0x0000000000148cfa _PyEval_EvalFrameDefault() ???:0
49 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
50 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
52 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
53 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
54 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
55 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
56 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
=================================
[2025-01-08 11:17:51 TP7] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
recv_reqs = self.recv_requests()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:26496
[2025-01-08 11:17:51 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
recv_reqs = self.recv_requests()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
dist.broadcast(tensor_size, src=0, group=dist_group)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:2711
Killed
Reproduction
node 1
python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 0 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768
node2
python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 1 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768
Environment
/usr/local/lib/python3.10/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import o
f cv2 has been skipped.
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed warnings.warn(message, UserWarning) Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.08
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post3
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.47.1
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.114.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
modelscope: 1.21.1
orjson: 3.10.13
packaging: 24.0
psutil: 5.9.8
pydantic: 2.9.1 multipart: 0.0.20 zmq: 26.0.3
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.6.4.post1 openai: 1.58.1 anthropic: 0.42.0 decord: 0.6.0
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 NIC17 NIC18 NIC19 NIC20 NIC21 NIC22 NIC23 NIC24 NIC25 CPU Affinity NUMA Affini
ty GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE NODE PHB PIX SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE NODE PIX PHB SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE PIX NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS NODE NODE PIX NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS PHB NODE NODE PIX 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS PIX NODE NODE PHB 96-191,288-383 1 N/A
NIC0 SYS SYS SYS SYS NODE NODE NODE NODE X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC1 SYS SYS SYS SYS NODE NODE NODE NODE PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC2 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC3 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC4 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC10 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC11 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC12 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC13 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC14 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC15 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIXPIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC16 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX SYS SYS SYS SYS NODE NODE NODE NODE
NIC17 SYS SYS SYS SYS NODE NODE NODE NODE PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X SYS SYS SYS SYS NODE NODE NODE NODE
NIC18 PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS X NODE NODE NODE SYS SYS SYS SYS
NIC19 NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE X NODE NODE SYS SYS SYS SYS
NIC20 NODE PHB PIX NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE NODE X PHB SYS SYS SYS SYS
NIC21 NODE PIX PHB NODE SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS NODE NODE PHB X SYS SYS SYS SYS
NIC22 SYS SYS SYS SYS NODE NODE PHB PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS X NODE NODE PHB
NIC23 SYS SYS SYS SYS NODE PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS NODE X NODE NODE
NIC24 SYS SYS SYS SYS PIX NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS NODE NODE X NODE
NIC25 SYS SYS SYS SYS NODE NODE PIX PHB NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODE NODENODE SYS SYS SYS SYS PHB NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
NIC12: mlx5_12
NIC13: mlx5_13
NIC14: mlx5_14
NIC15: mlx5_16
NIC16: mlx5_17
NIC17: mlx5_18
NIC18: mlx5_bond_1
NIC19: mlx5_bond_2
NIC20: mlx5_bond_3
NIC21: mlx5_bond_4
NIC22: mlx5_bond_5
NIC23: mlx5_bond_6
NIC24: mlx5_bond_7 NIC25: mlx5_bond_8
ulimit soft: 1024
I think this is due to your local nccl error since no one report this before.
I think this is due to your local nccl error since no one report this before.
I also encountered the same problem.
The same problem occurs on H100*16, where NCCL will coredump during allreduce.
Thanks for pointing this out. @sitabulaixizawaluduo @CSEEduanyu
cc @zhyncs
Core was generated by `sglang::scheduler '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f673afbf86a in c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allredu--Type <RET> for more, q to quit, c to continue without paging--
ce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}, c10d::OpType, char const*, bool) [clone .constprop.0] ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
[Current thread is 1 (Thread 0x7f52f1fff640 (LWP 164587))]
(gdb)
(gdb) bt
#0 0x00007f673afbf86a in c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}, c10d::OpType, char const*, bool) [clone .constprop.0] ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#1 0x00007f673afc05e0 in c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#2 0x00007f673afc0d05 in c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&)
() from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so
#3 0x00007f677466328e in c10d::ops::(anonymous namespace)::allreduce_CUDA(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#4 0x00007f6774666609 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > ()(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long), std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > >, c10::guts::typelist::typelist<c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long> >, std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > (c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long)>::call(c10::OperatorKernel, c10::DispatchKeySet, c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#5 0x00007f67746823ef in c10d::ProcessGroup::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#6 0x00007f6787990635 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(c10:--Type <RET> for more, q to quit, c to continue without paging--
:intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > (c10d::ProcessGroup::)(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > (c10d::ProcessGroup::)(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > ()(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#7 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#8 0x0000557dff248282 in ?? ()
#9 0x0000557dff23eb4b in _PyObject_MakeTpCall ()
#10 0x0000557dff255ebb in ?? ()
#11 0x0000557dff237b7a in _PyEval_EvalFrameDefault ()
#12 0x0000557dff248aec in _PyFunction_Vectorcall ()
#13 0x0000557dff256882 in PyObject_Call ()
#14 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#15 0x0000557dff248aec in _PyFunction_Vectorcall ()
#16 0x0000557dff233cf2 in _PyEval_EvalFrameDefault ()
#17 0x0000557dff248aec in _PyFunction_Vectorcall ()
#18 0x0000557dff232ae8 in _PyEval_EvalFrameDefault ()
#19 0x0000557dff248aec in _PyFunction_Vectorcall ()
#20 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#21 0x0000557dff248aec in _PyFunction_Vectorcall ()
#22 0x00007f678756ba20 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#23 0x00007f678789a21d in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#24 0x00007f6787892720 in pybind11::object pybind11::detail::argument_loader<pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&>::call_impl<pybind11::object, torch::impl::dispatch::initDispatchBindings(_object)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && [clone .isra.0] ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#25 0x00007f6787892cf0 in pybind11::cpp_function::initialize<torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}, pybind11::object, pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&&, pybind11::object ()(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11--Type <RET> for more, q to quit, c to continue without paging--
::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#26 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object, _object*, _object*) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#27 0x0000557dff248282 in ?? ()
#28 0x0000557dff23eb4b in _PyObject_MakeTpCall ()
#29 0x0000557dff256010 in ?? ()
#30 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#31 0x0000557dff255d2e in ?? ()
#32 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#33 0x0000557dff248aec in _PyFunction_Vectorcall ()
#34 0x00007f6787899935 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, c10::DispatchKeySet&, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(c10::DispatchKeySet&, pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#35 0x00007f678789a181 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#36 0x00007f6787892720 in pybind11::object pybind11::detail::argument_loader<pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&>::call_impl<pybind11::object, torch::impl::dispatch::initDispatchBindings(_object)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && [clone .isra.0] ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#37 0x00007f6787892cf0 in pybind11::cpp_function::initialize<torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}, pybind11::object, pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&&, pybind11::object ()(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#38 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object, _object*, _object*) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#39 0x0000557dff248282 in ?? ()
#40 0x0000557dff23eb4b in _PyObject_MakeTpCall ()
#41 0x0000557dff256010 in ?? ()
#42 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#43 0x0000557dff255d2e in ?? ()
#44 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#45 0x0000557dff248aec in _PyFunction_Vectorcall ()
#46 0x00007f67874534f9 in THPFunction_apply(_object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#47 0x0000557dff2482a8 in ?? ()
#48 0x0000557dff25681b in PyObject_Call ()
#49 0x0000557dff238eed in _PyEval_EvalFrameDefault ()
#50 0x0000557dff255d2e in ?? ()
#51 0x0000557dff234f59 in _PyEval_EvalFrameDefault ()
#52 0x0000557dff248aec in _PyFunction_Vectorcall ()
#53 0x00007f6787899935 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, c10::DispatchKeySet&, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(c10::DispatchKeySet&, pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#54 0x00007f678789a181 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
--Type <RET> for more, q to quit, c to continue without paging--
#55 0x00007f67878a4d98 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >) const ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#56 0x00007f67876328a1 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args, pybind11::kwargs const&, std::optionalc10::DispatchKey) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#57 0x00007f6787632bf9 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) ()
from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so
#58 0x00007f6787513e13 in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::basic_string<char, std::char_traits
can you also output all of ur NCCL env var? and have you run NCCL benchmark?
Encounter several times of similar segmentation fault error after running sglang for deepseek-r1 inference for a while
Error 1:
[node-1:4306 :0:10324] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1000000a1)
==== backtrace (tid: 10324) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x00000000000494f4 uploadProxyOps() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
2 0x0000000000051a7f hostStreamPlanTask() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
3 0x0000000000051bd9 hostStreamPlanCallback() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
4 0x000000000025720d cuEGLApiInit() ???:0
5 0x000000000026cf43 cuEGLApiInit() ???:0
6 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
7 0x0000000000126850 __xmknodat() ???:0
=================================
Fatal Python error: Segmentation fault
Thread 0x00007f79fafc5640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f6a497fe640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f6a20fff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 512 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 757 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f79fb7c6640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f7c00c7d640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f833ea89740 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/queue.py", line 171 in get
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 165 in resolve_batch_result
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1133 in process_batch_result_prefill
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1105 in process_batch_result
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
Error 2:
[node-1:5377 :0:10263] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd11b7734a0)
==== backtrace (tid: 10263) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x0000000000049b9e ncclMemoryPoolAlloc<ncclProxyOp>() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:289
2 0x0000000000049b9e addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180
3 0x0000000000049b9e addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176
4 0x000000000004c496 addCBDCollToPlan() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481
5 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844
6 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260
7 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
8 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
9 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
10 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
11 0x000000000004d74f ncclEnqueueCheck() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
12 0x00000000000452af ncclAllReduce() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50
13 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>() ProcessGroupNCCL.cpp:0
14 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl() ???:0
15 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce() ???:0
16 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA() Ops.cpp:0
17 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>() :0
18 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call() :0
19 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0
20 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0
21 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0
22 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call() :0
23 0x0000000005fae9bd c10d::ProcessGroup::allreduce() :0
24 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() :0
25 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0
26 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0
27 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
28 0x0000000000168acb PyMethod_New() ???:0
29 0x0000000000148cfa _PyEval_EvalFrameDefault() ???:0
30 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
31 0x0000000000169492 PyObject_Call() ???:0
32 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
33 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
34 0x000000000014453c _PyEval_EvalFrameDefault() ???:0
35 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
36 0x000000000014345c _PyEval_EvalFrameDefault() ???:0
37 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
38 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>() :0
39 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()() :0
40 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0
41 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0
42 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0
43 0x0000000000cff728 c10::Dispatcher::callBoxed() ???:0
44 0x0000000000a8e136 torch::jit::invokeOperatorFromPython() ???:0
45 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet() ???:0
46 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() init.cpp:0
47 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0
48 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0
49 0x000000000016942b PyObject_Call() ???:0
50 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall() ???:0
52 0x000000000014fcbd _PyObject_FastCallDictTstate() ???:0
53 0x000000000016586c _PyObject_Call_Prepend() ???:0
54 0x0000000000280700 PyInit__datetime() ???:0
55 0x0000000000150a7b _PyObject_MakeTpCall() ???:0
56 0x000000000014a150 _PyEval_EvalFrameDefault() ???:0
=================================
Fatal Python error: Segmentation fault
Thread 0x00007fb607ffd640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007fb60fffe640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007fb617fff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2501 in all_reduce
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 414 in _all_reduce_in_place
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 183 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 774 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007fbcb3fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007fca8ffff640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007fd398d50480 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1225 in process_batch_result_decode
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1101 in process_batch_result
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "<string>", line 1 in <module>
Encounter several times of similar segmentation fault error after running sglang for deepseek-r1 inference for a while
Error 1:
[node-1:4306 :0:10324] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1000000a1) ==== backtrace (tid: 10324) ==== 0 0x0000000000042520 __sigaction() ???:0 1 0x00000000000494f4 uploadProxyOps() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131 2 0x0000000000051a7f hostStreamPlanTask() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163 3 0x0000000000051bd9 hostStreamPlanCallback() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175 4 0x000000000025720d cuEGLApiInit() ???:0 5 0x000000000026cf43 cuEGLApiInit() ???:0 6 0x0000000000094ac3 pthread_condattr_setpshared() ???:0 7 0x0000000000126850 __xmknodat() ???:0 ================================= Fatal Python error: Segmentation fault Thread 0x00007f79fafc5640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007f6a497fe640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007f6a20fff640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 512 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 757 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_ File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007f79fb7c6640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007f7c00c7d640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007f833ea89740 (most recent call first): File "/usr/lib/python3.10/threading.py", line 320 in wait File "/usr/lib/python3.10/queue.py", line 171 in get File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 165 in resolve_batch_result File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1133 in process_batch_result_prefill File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1105 in process_batch_result File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main File "<string>", line 1 in <module>Error 2:
[node-1:5377 :0:10263] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd11b7734a0) ==== backtrace (tid: 10263) ==== 0 0x0000000000042520 __sigaction() ???:0 1 0x0000000000049b9e ncclMemoryPoolAlloc<ncclProxyOp>() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:289 2 0x0000000000049b9e addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180 3 0x0000000000049b9e addProxyOpIfNeeded() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176 4 0x000000000004c496 addCBDCollToPlan() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481 5 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844 6 0x000000000004f5bd ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260 7 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129 8 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339 9 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418 10 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368 11 0x000000000004d74f ncclEnqueueCheck() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032 12 0x00000000000452af ncclAllReduce() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50 13 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>() ProcessGroupNCCL.cpp:0 14 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl() ???:0 15 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce() ???:0 16 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA() Ops.cpp:0 17 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>() :0 18 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call() :0 19 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0 20 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0 21 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0 22 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call() :0 23 0x0000000005fae9bd c10d::ProcessGroup::allreduce() :0 24 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() :0 25 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0 26 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0 27 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 28 0x0000000000168acb PyMethod_New() ???:0 29 0x0000000000148cfa _PyEval_EvalFrameDefault() ???:0 30 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 31 0x0000000000169492 PyObject_Call() ???:0 32 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 33 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 34 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 35 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 36 0x000000000014345c _PyEval_EvalFrameDefault() ???:0 37 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 38 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>() :0 39 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()() :0 40 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0 41 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0 42 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0 43 0x0000000000cff728 c10::Dispatcher::callBoxed() ???:0 44 0x0000000000a8e136 torch::jit::invokeOperatorFromPython() ???:0 45 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet() ???:0 46 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() init.cpp:0 47 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0 48 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0 49 0x000000000016942b PyObject_Call() ???:0 50 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 51 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 52 0x000000000014fcbd _PyObject_FastCallDictTstate() ???:0 53 0x000000000016586c _PyObject_Call_Prepend() ???:0 54 0x0000000000280700 PyInit__datetime() ???:0 55 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 56 0x000000000014a150 _PyEval_EvalFrameDefault() ???:0 ================================= Fatal Python error: Segmentation fault Thread 0x00007fb607ffd640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007fb60fffe640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Current thread 0x00007fb617fff640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2501 in all_reduce File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 414 in _all_reduce_in_place File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__ File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 183 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 774 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_ File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007fbcb3fff640 (most recent call first): File "/usr/lib/python3.10/threading.py", line 324 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007fca8ffff640 (most recent call first): File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread File "/usr/lib/python3.10/threading.py", line 953 in run File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007fd398d50480 (most recent call first): File "/usr/lib/python3.10/threading.py", line 320 in wait File "/usr/lib/python3.10/threading.py", line 607 in wait File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1225 in process_batch_result_decode File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1101 in process_batch_result File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main File "<string>", line 1 in <module>
Have you found the specific reason? I deployed multiple deepseek-v3 instances. For the same request, some nodes encountered this problem, and some nodes could calculate normally.
Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.
Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.
Same solution, but some instances trigger restarts too often, resulting in affecting the service effect
Same issue, anyone figure it out?
Thanks so much for noticing us. This is really urgent and we are working on this
Similar problem, on 16xH800
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
Current thread 0x00007faad27fc700 (most recent call first):
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3435 in all_gather_into_tensor
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 649 in all_gather
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 768 in forward
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 770 in forward_idle
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 787 in forward
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/lib/python3.10/threading.py", line 953 in run
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-02-08 23:45:46
jo-dardhmricga77dqo-worker-1
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
Same problem happens when using vLLM. And disabling CUDA graph doesn't help.
Will try to reproduce it with NCCL debug flags on.
Similar problem, on 8xH20.
Does anyone have any ideas for solving it? I encountered the same problem on H100.
[node11:265 :0:4597] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x52b00e)
==== backtrace (tid: 4597) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000004f03d ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:574
2 0x000000000004f03d ncclLaunchPrepare() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1275
3 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
4 0x0000000000053d4b groupLaunch() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
5 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
6 0x0000000000054f88 ncclGroupEndInternal() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
7 0x000000000004d74f ncclEnqueueCheck() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
8 0x00000000000452af ncclAllReduce() /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50
9 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>() ProcessGroupNCCL.cpp:0
10 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl() ???:0
11 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce() ???:0
12 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA() Ops.cpp:0
13 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>() :0
14 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call() :0
15 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0
16 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0
17 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0
18 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call() :0
19 0x0000000005fae9bd c10d::ProcessGroup::allreduce() :0
20 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() :0
21 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0
22 0x000000000018ab32 PyObject_CallFunctionObjArgs() ???:0
23 0x000000000018139b _PyObject_MakeTpCall() ???:0
24 0x00000000001987ab PyMethod_New() ???:0
25 0x000000000017a702 _PyEval_EvalFrameDefault() ???:0
26 0x000000000018b38c _PyFunction_Vectorcall() ???:0
27 0x0000000000199172 PyObject_Call() ???:0
28 0x0000000000177c30 _PyEval_EvalFrameDefault() ???:0
29 0x000000000018b38c _PyFunction_Vectorcall() ???:0
30 0x00000000001769ab _PyEval_EvalFrameDefault() ???:0
31 0x000000000018b38c _PyFunction_Vectorcall() ???:0
32 0x000000000017597f _PyEval_EvalFrameDefault() ???:0
33 0x000000000018b38c _PyFunction_Vectorcall() ???:0
34 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>() :0
35 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()() :0
36 0x00000000055b224b c10::OperatorHandle::redispatchBoxed() :0
37 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl() autograd_not_implemented_fallback.cpp:0
38 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>() VariableFallbackKernel.cpp:0
39 0x0000000000cff728 c10::Dispatcher::callBoxed() ???:0
40 0x0000000000a8e136 torch::jit::invokeOperatorFromPython() ???:0
41 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet() ???:0
42 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() init.cpp:0
43 0x00000000004cb474 pybind11::cpp_function::dispatcher() :0
44 0x000000000018ab32 PyObject_CallFunctionObjArgs() ???:0
45 0x000000000019910b PyObject_Call() ???:0
46 0x000000000017b6ef _PyEval_EvalFrameDefault() ???:0
47 0x000000000018b38c _PyFunction_Vectorcall() ???:0
48 0x000000000018061d _PyObject_FastCallDictTstate() ???:0
49 0x000000000019562c _PyObject_Call_Prepend() ???:0
50 0x000000000029d464 PyInit__datetime() ???:0
51 0x000000000018139b _PyObject_MakeTpCall() ???:0
52 0x000000000017b99e _PyEval_EvalFrameDefault() ???:0
53 0x000000000018b38c _PyFunction_Vectorcall() ???:0
54 0x000000000017597f _PyEval_EvalFrameDefault() ???:0
55 0x000000000018b38c _PyFunction_Vectorcall() ???:0
56 0x0000000000175790 _PyEval_EvalFrameDefault() ???:0
=================================
Fatal Python error: Segmentation fault
There was the --enable-dp-attention option in my previous code startup options. After I turned off this option, the model inferred hundreds of requests normally. I noticed that other friends' commands did not turn this option on. I'm not sure whether it was caused by this option.
cc @FrankLeeeee Could you take a look
I am looking into this issue, the error occurs during the model forward pass and based on @jt-z 's description, it is relevant to dp attention, may I know how long does it take to see this error once the server is booted? @jt-z
I found that this problem seems to occur because this option is turned on after the image updated yesterday. My impression is that there was no problem when the dp option was turned on in the image with the CUDA 12.4 logo. After the image was updated to display the triton icon and then added the dp option, this error occurred after an inference request was successfully deployed.
I used R1 with 2x8 H100, didn't turn on --enable-dp-attention and got the same error.
I encountered the same issue. The error occurs randomly, for example:
[master0:107 :0:107] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1) Fatal Python error: Segmentation fault
Sorry for pointing this once again. I am losing track of DPSK issue. How is going? @FrankLeeeee
Got the same error with 2x8 H100 and without turning on --enable-dp-attention.
Enable the “--disable-custom-all-reduce” parameter to see if it can circumvent the issue.
n my case, I found that after reverting the version from v0.4.3 to v0.4.2.post3, the crash has not occurred again. Could this be caused by some changes introduced between these versions?
Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.
Same solution, but some instances trigger restarts too often, resulting in affecting the service effect
I update nccl-2.21.5 to nccl-2.25.1, the crash problem did not occur again
I got the same issue, Did this have solved?
same issue!
I believe this is also a problem with the NCCL library itself. I also had to force an upgrade from nccl-2.21.5 to nccl-2.25.1, and the problem was resolved, although pip would prompt that the torch and nccl versions do not match, but the model can run normally.