sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[Bug] NCCL Crash with SIGSEGV Frequently when deploying deepseek v3

Open looput opened this issue 10 months ago • 20 comments

Checklist

  • [X] 1. I have searched related issues but cannot get the expected help.
  • [X] 2. The bug has not been fixed in the latest version.
  • [X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • [X] 5. Please use English, otherwise it will be closed.

Describe the bug

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3)                                                                                              ==== backtrace (tid: 212877) ====                                                                                                                                                                           
0 0x0000000000042520 __sigaction()  ???:0                                                                                                                                                                   1 0x0000000000049b8a ncclMemoryPoolAlloc<ncclProxyOp>()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:280                                                                              
2 0x0000000000049b8a addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180                                                                                                  3 0x0000000000049b8a addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176                                                          
4 0x000000000004c496 addCBDCollToPlan()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481                                                                                                   
5 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844                                                                                                  
6 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260                                                                                                 
7 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
8 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
9 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418                                                          
10 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368                                                          
11 0x000000000004d74f ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032                                                                                                  
12 0x0000000000044b36 ncclAllGather()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:26                                                                                                   
13 0x00000000011fd1f3 c10d::ProcessGroupNCCL::_allgather_base()  ???:0                            
14 0x0000000005f8e9b8 c10d::ops::(anonymous namespace)::_allgather_base_CUDA()  Ops.cpp:0         
15 0x0000000005f985cc c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_defa
ult_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at:
:Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10:
:detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call()  :0                                                                                                 
16 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0                                                                                                                                            
17 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0                                            18 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0                                                                      
19 0x0000000005f9fc2e c10::impl::BoxedKernelWrapper<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), void>::call()  :0                                                                
20 0x0000000005fabfe8 c10d::ProcessGroup::_allgather_base()  :0                                                                                                                                             21 0x0000000000df6c7e pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup
, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name con
st&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at:
:Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybin
d11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybin
d11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name co
nst&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11
::detail::function_call&)#3}::_FUN()  :0      

22 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0                                                                                                                                              
23 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0                                           
24 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0                                                                                                                                                         
25 0x0000000000168acb PyMethod_New()  ???:0                                                                                                                                                                 
26 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                             
28 0x0000000000169492 PyObject_Call()  ???:0                                                          
29 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
30 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
31 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0                                               
32 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
33 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0                                               
34 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0                                                                                                                                                       
35 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     
36 0x000000000016893e PyMethod_New()  ???:0                                                           
37 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0                                                                                                                                                     38 0x000000000016893e PyMethod_New()  ???:0                                                                                                                                                                 
39 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
40 0x000000000014fc14 _PyObject_FastCallDictTstate()  ???:0
41 0x000000000016586c _PyObject_Call_Prepend()  ???:0
42 0x0000000000280700 PyInit__datetime()  ???:0
43 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0 
44 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
46 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
47 0x00000000001687f1 PyMethod_New()  ???:0
48 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0
49 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
50 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
52 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
53 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
54 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
55 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
56 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
=================================
[2025-01-08 11:17:51 TP7] Scheduler hit an exception: Traceback (most recent call last):
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
   scheduler.event_loop_overlap()
 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
   recv_reqs = self.recv_requests()
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
   recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
   dist.broadcast(tensor_size, src=0, group=dist_group)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
   work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:26496

[2025-01-08 11:17:51 TP1] Scheduler hit an exception: Traceback (most recent call last):
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1578, in run_scheduler_process
   scheduler.event_loop_overlap()
 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 410, in event_loop_overlap
   recv_reqs = self.recv_requests()
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 459, in recv_requests
   recv_reqs = broadcast_pyobj(recv_reqs, self.tp_rank, self.tp_cpu_group)
 File "/usr/local/lib/python3.10/dist-packages/sglang/srt/utils.py", line 731, in broadcast_pyobj
   dist.broadcast(tensor_size, src=0, group=dist_group)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83, in wrapper 
   return func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
   work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [29.127.64.100]:2711

Killed

Reproduction

node 1 python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 0 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768

node2 python -m sglang.launch_server --model-path DeepSeek-V3 --tp 16 --nccl-init 29.127.64.100:5000 --nnodes 2 --node-rank 1 --trust-remote-code --port 80 --host 0.0.0.0 --schedule-conservativeness 0.3 --context-length 32768

Environment

/usr/local/lib/python3.10/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.4                                                        
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"                                                                                                                                                              
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import o
f cv2 has been skipped.                                                                                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:                                                                                                          
* 'fields' has been removed                                                                                                                                                                                                              warnings.warn(message, UserWarning)                                                                                                                                                                                                  Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]                                                                                                                                                                             CUDA available: True                                                                                                                                                                                                                   
GPU 0,1,2,3,4,5,6,7: NVIDIA H20                                                                                    
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0                                                                                                                                                                                            
CUDA_HOME: /usr/local/cuda                                                                                         
NVCC: Cuda compilation tools, release 12.4, V12.4.131                                                                                                                                                                                  
CUDA Driver Version: 535.161.08                                                                                    
PyTorch: 2.5.1+cu124                                                                                                                                                                                                                   
sglang: 0.4.1.post3                                                                                                
flashinfer: 0.1.6+cu124torch2.4                                                                                                                                                                                                        
triton: 3.1.0                                                                                                      
transformers: 4.47.1                                                                                                                                                                                                                   
torchao: 0.7.0                                                                                                     
numpy: 1.26.4                                                                                                                                                                                                                          
aiohttp: 3.9.5                                                                                                     
fastapi: 0.114.1                                                                                                                                                                                                                       
hf_transfer: 0.1.8                                                                                                 
huggingface_hub: 0.24.7                                                                                                                                                                                                                
interegular: 0.3.3                                                                                                 
modelscope: 1.21.1                                                                                                                                                                                                                     
orjson: 3.10.13                                                                                                    
packaging: 24.0                                                                                                                                                                                                                        
psutil: 5.9.8                                                                                                      
pydantic: 2.9.1                                                                                                                                                                                                                        multipart: 0.0.20                                                                                                  zmq: 26.0.3                                                                                                                                                                                                                            
uvicorn: 0.30.6                                                                                                    
uvloop: 0.20.0                                                                                                                                                                                                                         
vllm: 0.6.4.post1                                                                                                  openai: 1.58.1                                                                                                                                                                                                                         anthropic: 0.42.0                                                                                                  decord: 0.6.0    

Legend:                                                                                                                                                                                                                                
                                                                                                                                                                                                                                       
  X    = Self                                                                                                                                                                                                                          
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)                                                                                                                                 
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node                                                                                                                           
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                                                                                                                                                  
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)                                                                                                                                         
  PIX  = Connection traversing at most a single PCIe bridge                                                                                                                                                                            
  NV#  = Connection traversing a bonded set of # NVLinks                                                                                                                                                                               
                                                                                                                                                                                                                                       
NIC Legend:       

NVIDIA Topology:                                                                                                                                                                                                                                                                                                       
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   NIC13   NIC14   NIC15   NIC16       NIC17   NIC18   NIC19   NIC20   NIC21   NIC22   NIC23   NIC24   NIC25   CPU Affinity    NUMA Affini
ty   GPU NUMA ID                                                                                                                                                                                                                                                                                                       
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PHB     PIX     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PIX     PHB     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PHB     NODE    NODE    PIX     96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PIX     NODE    NODE    PHB     96-191,288-383  1               N/A
NIC0    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC1    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC2    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC10   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC11   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC12   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC13   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC14   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIX     PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC15   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X      PIXPIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC16   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX      X PIX      SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC17   SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX     PIX X       SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE
NIC18   PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS       X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC19   NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE     X      NODE    NODE    SYS     SYS     SYS     SYS
NIC20   NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE     X      PHB     SYS     SYS     SYS     SYS
NIC21   NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYSSYS      NODE    NODE    PHB      X      SYS     SYS     SYS     SYS
NIC22   SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS      X      NODE    NODE    PHB
NIC23   SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE
NIC24   SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE
NIC25   SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODE    NODENODE    SYS     SYS     SYS     SYS     PHB     NODE    NODE     X 
                                                                                                                                                                                                                                       
Legend:                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                       
  X    = Self                                                                                                                                                                                                                                                             
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)                                                                                                                                 
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node                                                                                                                                                              
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)                                                                                                                                                  
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)                                                                                                                                                                            
  PIX  = Connection traversing at most a single PCIe bridge                                                                                                                                                                            
  NV#  = Connection traversing a bonded set of # NVLinks                                                                                                                                                                                                                  
                                                                                                                                                                                                                                       
NIC Legend:                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                       
  NIC0: mlx5_0                                                                                                                                                                                                                                                            
  NIC1: mlx5_1                                                                                                                                                                                                                         
  NIC2: mlx5_2                                                                                                                                                                                                                                                            
  NIC3: mlx5_3                                                                                                                                                                                                                         
  NIC4: mlx5_4                                                                                                                                                                                                                                                            
  NIC5: mlx5_5                                                                                                                                                                                                                         
  NIC6: mlx5_6                                                                                                                                                                                                                                                            
  NIC7: mlx5_7                                                                                                                                                                                                                         
  NIC8: mlx5_8                                                                                                                                                                                                                                                            
  NIC9: mlx5_9                                                                                                                                                                                                                         
  NIC10: mlx5_10                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                         NIC0: mlx5_0                                                                                                                                                                                                                           NIC1: mlx5_1                                                                                                                                                                                                                           NIC2: mlx5_2                                                                                                                                                                                                                         
  NIC3: mlx5_3                                                                                                     
  NIC4: mlx5_4                                                                                                                                                                                                                         
  NIC5: mlx5_5                                                                                                     
  NIC6: mlx5_6                                                                                                                                                                                                                         
  NIC7: mlx5_7                                                                                                     
  NIC8: mlx5_8                                                                                                                                                                                                                         
  NIC9: mlx5_9                                                                                                     
  NIC10: mlx5_10                                                                                                                                                                                                                       
  NIC11: mlx5_11                                                                                                   
  NIC12: mlx5_12                                                                                                                                                                                                                       
  NIC13: mlx5_13                                                                                                   
  NIC14: mlx5_14                                                                                                                                                                                                                       
  NIC15: mlx5_16                                                                                                   
  NIC16: mlx5_17                                                                                                                                                                                                                       
  NIC17: mlx5_18                                                                                                   
  NIC18: mlx5_bond_1                                                                                                                                                                                                                   
  NIC19: mlx5_bond_2                                                                                               
  NIC20: mlx5_bond_3                                                                                                                                                                                                                   
  NIC21: mlx5_bond_4                                                                                               
  NIC22: mlx5_bond_5                                                                                                                                                                                                                   
  NIC23: mlx5_bond_6                                                                                               
  NIC24: mlx5_bond_7                                                                                                                                                                                                                     NIC25: mlx5_bond_8                                                                                                                                                                                                                                                                                                                                      
                                                                                                                   
ulimit soft: 1024

looput avatar Jan 09 '25 02:01 looput

I think this is due to your local nccl error since no one report this before.

zhaochenyang20 avatar Jan 21 '25 22:01 zhaochenyang20

I think this is due to your local nccl error since no one report this before.

I also encountered the same problem.

sitabulaixizawaluduo avatar Jan 22 '25 10:01 sitabulaixizawaluduo

The same problem occurs on H100*16, where NCCL will coredump during allreduce.

CSEEduanyu avatar Jan 23 '25 02:01 CSEEduanyu

Thanks for pointing this out. @sitabulaixizawaluduo @CSEEduanyu

cc @zhyncs

zhaochenyang20 avatar Jan 23 '25 08:01 zhaochenyang20

Core was generated by `sglang::scheduler '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f673afbf86a in c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allredu--Type <RET> for more, q to quit, c to continue without paging-- ce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}, c10d::OpType, char const*, bool) [clone .constprop.0] () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so [Current thread is 1 (Thread 0x7f52f1fff640 (LWP 164587))] (gdb) (gdb) bt #0 0x00007f673afbf86a in c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_typec10d::ProcessGroupNCCL::WorkNCCL >&)#2}, c10d::OpType, char const*, bool) [clone .constprop.0] () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so #1 0x00007f673afc05e0 in c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so #2 0x00007f673afc0d05 in c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so #3 0x00007f677466328e in c10d::ops::(anonymous namespace)::allreduce_CUDA(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so #4 0x00007f6774666609 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > ()(c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long), std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > >, c10::guts::typelist::typelist<c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long> >, std::tuple<std::vector<at::Tensor, std::allocatorat::Tensor >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > > (c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long)>::call(c10::OperatorKernel, c10::DispatchKeySet, c10::ArrayRefat::Tensor, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_typec10d::ProcessGroup > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_typec10d::ReduceOp > const&, std::optionalat::Tensor const&, long) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so #5 0x00007f67746823ef in c10d::ProcessGroup::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so #6 0x00007f6787990635 in pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(c10:--Type <RET> for more, q to quit, c to continue without paging-- :intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > (c10d::ProcessGroup::)(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guardpybind11::gil_scoped_release >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > (c10d::ProcessGroup::)(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_typec10d::Work > ()(c10d::ProcessGroup, std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guardpybind11::gil_scoped_release const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #7 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #8 0x0000557dff248282 in ?? () #9 0x0000557dff23eb4b in _PyObject_MakeTpCall () #10 0x0000557dff255ebb in ?? () #11 0x0000557dff237b7a in _PyEval_EvalFrameDefault () #12 0x0000557dff248aec in _PyFunction_Vectorcall () #13 0x0000557dff256882 in PyObject_Call () #14 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #15 0x0000557dff248aec in _PyFunction_Vectorcall () #16 0x0000557dff233cf2 in _PyEval_EvalFrameDefault () #17 0x0000557dff248aec in _PyFunction_Vectorcall () #18 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #19 0x0000557dff248aec in _PyFunction_Vectorcall () #20 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #21 0x0000557dff248aec in _PyFunction_Vectorcall () #22 0x00007f678756ba20 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #23 0x00007f678789a21d in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #24 0x00007f6787892720 in pybind11::object pybind11::detail::argument_loader<pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&>::call_impl<pybind11::object, torch::impl::dispatch::initDispatchBindings(_object)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && [clone .isra.0] () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #25 0x00007f6787892cf0 in pybind11::cpp_function::initialize<torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}, pybind11::object, pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&&, pybind11::object ()(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11--Type <RET> for more, q to quit, c to continue without paging-- ::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #26 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object, _object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #27 0x0000557dff248282 in ?? () #28 0x0000557dff23eb4b in _PyObject_MakeTpCall () #29 0x0000557dff256010 in ?? () #30 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #31 0x0000557dff255d2e in ?? () #32 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #33 0x0000557dff248aec in _PyFunction_Vectorcall () #34 0x00007f6787899935 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, c10::DispatchKeySet&, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(c10::DispatchKeySet&, pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #35 0x00007f678789a181 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #36 0x00007f6787892720 in pybind11::object pybind11::detail::argument_loader<pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&>::call_impl<pybind11::object, torch::impl::dispatch::initDispatchBindings(_object)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, 0ul, 1ul, 2ul, 3ul, pybind11::detail::void_type>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul>, pybind11::detail::void_type&&) && [clone .isra.0] () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #37 0x00007f6787892cf0 in pybind11::cpp_function::initialize<torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}, pybind11::object, pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&, pybind11::name, pybind11::is_method, pybind11::sibling>(torch::impl::dispatch::initDispatchBindings(_object*)::{lambda(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&)#1}&&, pybind11::object ()(pybind11::object const&, c10::DispatchKeySet, pybind11::args, pybind11::kwargs const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #38 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object, _object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #39 0x0000557dff248282 in ?? () #40 0x0000557dff23eb4b in _PyObject_MakeTpCall () #41 0x0000557dff256010 in ?? () #42 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #43 0x0000557dff255d2e in ?? () #44 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #45 0x0000557dff248aec in _PyFunction_Vectorcall () #46 0x00007f67874534f9 in THPFunction_apply(_object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #47 0x0000557dff2482a8 in ?? () #48 0x0000557dff25681b in PyObject_Call () #49 0x0000557dff238eed in _PyEval_EvalFrameDefault () #50 0x0000557dff255d2e in ?? () #51 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #52 0x0000557dff248aec in _PyFunction_Vectorcall () #53 0x00007f6787899935 in pybind11::object pybind11::detail::object_apipybind11::handle::operator()<(pybind11::return_value_policy)1, c10::DispatchKeySet&, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>(c10::DispatchKeySet&, pybind11::detail::args_proxy&&, pybind11::detail::kwargs_proxy&&) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #54 0x00007f678789a181 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocatorc10::IValue >) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so --Type <RET> for more, q to quit, c to continue without paging-- #55 0x00007f67878a4d98 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocatorc10::IValue >) const () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #56 0x00007f67876328a1 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, pybind11::args, pybind11::kwargs const&, std::optionalc10::DispatchKey) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #57 0x00007f6787632bf9 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptrtorch::jit::Operator, std::allocator<std::shared_ptrtorch::jit::Operator > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, std::optionalc10::DispatchKey) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #58 0x00007f6787513e13 in pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::basic_string<char, std::char_traits, std::allocator > const&)#216}::operator()(std::basic_string<char, std::char_traits, std::allocator > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}, pybind11::object, pybind11::args, pybind11::kwargs, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::basic_string<char, std::char_traits, std::allocator > const&)#216}::operator()(std::basic_string<char, std::char_traits, std::allocator > const&) const::{lambda(pybind11::args, pybind11::kwargs)#1}&&, pybind11::object ()(pybind11::args, pybind11::kwargs), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #59 0x00007f678708d0e4 in pybind11::cpp_function::dispatcher(_object, _object*, _object*) () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so #60 0x0000557dff248282 in ?? () #61 0x0000557dff25681b in PyObject_Call () #62 0x0000557dff238eed in _PyEval_EvalFrameDefault () #63 0x0000557dff248aec in _PyFunction_Vectorcall () #64 0x0000557dff23ddbd in _PyObject_FastCallDictTstate () #65 0x0000557dff252d4c in _PyObject_Call_Prepend () #66 0x0000557dff35b054 in ?? () #67 0x0000557dff23eb4b in _PyObject_MakeTpCall () #68 0x0000557dff2389ea in _PyEval_EvalFrameDefault () #69 0x0000557dff248aec in _PyFunction_Vectorcall () #70 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #71 0x0000557dff248aec in _PyFunction_Vectorcall () #72 0x0000557dff2329a2 in _PyEval_EvalFrameDefault () #73 0x0000557dff255d2e in ?? () #74 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #75 0x0000557dff255d2e in ?? () #76 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #77 0x0000557dff23dd14 in _PyObject_FastCallDictTstate () #78 0x0000557dff252d4c in _PyObject_Call_Prepend () #79 0x0000557dff35b054 in ?? () #80 0x0000557dff23eb4b in _PyObject_MakeTpCall () #81 0x0000557dff2384c8 in _PyEval_EvalFrameDefault () #82 0x0000557dff255be1 in ?? () #83 0x0000557dff237b7a in _PyEval_EvalFrameDefault () #84 0x0000557dff255be1 in ?? () #85 0x0000557dff256882 in PyObject_Call () #86 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #87 0x0000557dff255be1 in ?? () #88 0x0000557dff256882 in PyObject_Call () #89 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #90 0x0000557dff248aec in _PyFunction_Vectorcall () --Type <RET> for more, q to quit, c to continue without paging-- #91 0x0000557dff23ddbd in _PyObject_FastCallDictTstate () #92 0x0000557dff252d4c in _PyObject_Call_Prepend () #93 0x0000557dff35b054 in ?? () #94 0x0000557dff23eb4b in _PyObject_MakeTpCall () #95 0x0000557dff2389ea in _PyEval_EvalFrameDefault () #96 0x0000557dff255d2e in ?? () #97 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #98 0x0000557dff255d2e in ?? () #99 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #100 0x0000557dff23dd14 in _PyObject_FastCallDictTstate () #101 0x0000557dff252d4c in _PyObject_Call_Prepend () #102 0x0000557dff35b054 in ?? () #103 0x0000557dff23eb4b in _PyObject_MakeTpCall () #104 0x0000557dff237f4d in _PyEval_EvalFrameDefault () #105 0x0000557dff255d2e in ?? () #106 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #107 0x0000557dff255d2e in ?? () #108 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #109 0x0000557dff23dd14 in _PyObject_FastCallDictTstate () #110 0x0000557dff252d4c in _PyObject_Call_Prepend () #111 0x0000557dff35b054 in ?? () #112 0x0000557dff23eb4b in _PyObject_MakeTpCall () #113 0x0000557dff2384c8 in _PyEval_EvalFrameDefault () #114 0x0000557dff248aec in _PyFunction_Vectorcall () #115 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #116 0x0000557dff255be1 in ?? () #117 0x0000557dff237b7a in _PyEval_EvalFrameDefault () #118 0x0000557dff248aec in _PyFunction_Vectorcall () #119 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #120 0x0000557dff248aec in _PyFunction_Vectorcall () #121 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #122 0x0000557dff248aec in _PyFunction_Vectorcall () #123 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #124 0x0000557dff248aec in _PyFunction_Vectorcall () #125 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #126 0x0000557dff248aec in _PyFunction_Vectorcall () #127 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #128 0x0000557dff255e41 in ?? () #129 0x0000557dff234f59 in _PyEval_EvalFrameDefault () #130 0x0000557dff248aec in _PyFunction_Vectorcall () #131 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #132 0x0000557dff248aec in _PyFunction_Vectorcall () #133 0x0000557dff232ae8 in _PyEval_EvalFrameDefault () #134 0x0000557dff255e41 in ?? () #135 0x0000557dff36ae1a in ?? () #136 0x0000557dff360d98 in ?? () #137 0x00007f682e6cfac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #138 0x00007f682e761850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

CSEEduanyu avatar Jan 26 '25 02:01 CSEEduanyu

can you also output all of ur NCCL env var? and have you run NCCL benchmark?

slin1237 avatar Jan 29 '25 01:01 slin1237

Encounter several times of similar segmentation fault error after running sglang for deepseek-r1 inference for a while

Error 1:

[node-1:4306 :0:10324] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1000000a1)
==== backtrace (tid:  10324) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x000000000025720d cuEGLApiInit()  ???:0
 5 0x000000000026cf43 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f79fafc5640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6a497fe640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6a20fff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 512 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 757 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f79fb7c6640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7c00c7d640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f833ea89740 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/queue.py", line 171 in get
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 165 in resolve_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1133 in process_batch_result_prefill
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1105 in process_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Error 2:

[node-1:5377 :0:10263] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd11b7734a0)
==== backtrace (tid:  10263) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000049b9e ncclMemoryPoolAlloc<ncclProxyOp>()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:289
 2 0x0000000000049b9e addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180
 3 0x0000000000049b9e addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176
 4 0x000000000004c496 addCBDCollToPlan()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481
 5 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844
 6 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260
 7 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
 8 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
 9 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
10 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
11 0x000000000004d74f ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
12 0x00000000000452af ncclAllReduce()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50
13 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
14 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
15 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce()  ???:0
16 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA()  Ops.cpp:0
17 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>()  :0
18 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call()  :0
19 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
20 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
21 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
22 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call()  :0
23 0x0000000005fae9bd c10d::ProcessGroup::allreduce()  :0
24 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
25 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
26 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
27 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
28 0x0000000000168acb PyMethod_New()  ???:0
29 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0
30 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
31 0x0000000000169492 PyObject_Call()  ???:0
32 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
33 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
34 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
35 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
36 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
37 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
38 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
39 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
40 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
41 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
42 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
43 0x0000000000cff728 c10::Dispatcher::callBoxed()  ???:0
44 0x0000000000a8e136 torch::jit::invokeOperatorFromPython()  ???:0
45 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet()  ???:0
46 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
47 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
48 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
49 0x000000000016942b PyObject_Call()  ???:0
50 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
52 0x000000000014fcbd _PyObject_FastCallDictTstate()  ???:0
53 0x000000000016586c _PyObject_Call_Prepend()  ???:0
54 0x0000000000280700 PyInit__datetime()  ???:0
55 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
56 0x000000000014a150 _PyEval_EvalFrameDefault()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007fb607ffd640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb60fffe640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fb617fff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2501 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 414 in _all_reduce_in_place
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 183 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 774 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbcb3fff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fca8ffff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fd398d50480 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1225 in process_batch_result_decode
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1101 in process_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

xwjabc avatar Feb 01 '25 00:02 xwjabc

Encounter several times of similar segmentation fault error after running sglang for deepseek-r1 inference for a while

Error 1:

[node-1:4306 :0:10324] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1000000a1)
==== backtrace (tid:  10324) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000494f4 uploadProxyOps()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1131
 2 0x0000000000051a7f hostStreamPlanTask()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1163
 3 0x0000000000051bd9 hostStreamPlanCallback()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1175
 4 0x000000000025720d cuEGLApiInit()  ???:0
 5 0x000000000026cf43 cuEGLApiInit()  ???:0
 6 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 7 0x0000000000126850 __xmknodat()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007f79fafc5640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6a497fe640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f6a20fff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 512 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 757 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f79fb7c6640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f7c00c7d640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f833ea89740 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/queue.py", line 171 in get
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 165 in resolve_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1133 in process_batch_result_prefill
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1105 in process_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Error 2:

[node-1:5377 :0:10263] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd11b7734a0)
==== backtrace (tid:  10263) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000049b9e ncclMemoryPoolAlloc<ncclProxyOp>()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/include/utils.h:289
 2 0x0000000000049b9e addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:180
 3 0x0000000000049b9e addProxyOpIfNeeded()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:176
 4 0x000000000004c496 addCBDCollToPlan()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:481
 5 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:844
 6 0x000000000004f5bd ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1260
 7 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
 8 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
 9 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
10 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
11 0x000000000004d74f ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
12 0x00000000000452af ncclAllReduce()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50
13 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
14 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
15 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce()  ???:0
16 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA()  Ops.cpp:0
17 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>()  :0
18 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call()  :0
19 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
20 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
21 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
22 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call()  :0
23 0x0000000005fae9bd c10d::ProcessGroup::allreduce()  :0
24 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
25 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
26 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
27 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
28 0x0000000000168acb PyMethod_New()  ???:0
29 0x0000000000148cfa _PyEval_EvalFrameDefault()  ???:0
30 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
31 0x0000000000169492 PyObject_Call()  ???:0
32 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
33 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
34 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
35 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
36 0x000000000014345c _PyEval_EvalFrameDefault()  ???:0
37 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
38 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
39 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
40 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
41 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
42 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
43 0x0000000000cff728 c10::Dispatcher::callBoxed()  ???:0
44 0x0000000000a8e136 torch::jit::invokeOperatorFromPython()  ???:0
45 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet()  ???:0
46 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
47 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
48 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
49 0x000000000016942b PyObject_Call()  ???:0
50 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
51 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
52 0x000000000014fcbd _PyObject_FastCallDictTstate()  ???:0
53 0x000000000016586c _PyObject_Call_Prepend()  ???:0
54 0x0000000000280700 PyInit__datetime()  ???:0
55 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
56 0x000000000014a150 _PyEval_EvalFrameDefault()  ???:0
=================================
Fatal Python error: Segmentation fault

Thread 0x00007fb607ffd640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fb60fffe640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 461 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fb617fff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2501 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 414 in _all_reduce_in_place
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 112 in inplace_all_reduce
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/parallel_state.py", line 398 in all_reduce
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/distributed/communication_op.py", line 13 in tensor_model_parallel_all_reduce
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 183 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 774 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 819 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/deepseek_v2.py", line 858 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 750 in forward_extend
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 785 in forward
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbcb3fff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fca8ffff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 47 in _recv_msg
  File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 153 in _read_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fd398d50480 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 167 in resolve_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1225 in process_batch_result_decode
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1101 in process_batch_result
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 518 in event_loop_overlap
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1782 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Have you found the specific reason? I deployed multiple deepseek-v3 instances. For the same request, some nodes encountered this problem, and some nodes could calculate normally.

sitabulaixizawaluduo avatar Feb 06 '25 07:02 sitabulaixizawaluduo

Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.

xwjabc avatar Feb 06 '25 07:02 xwjabc

Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.

Same solution, but some instances trigger restarts too often, resulting in affecting the service effect

sitabulaixizawaluduo avatar Feb 06 '25 07:02 sitabulaixizawaluduo

Same issue, anyone figure it out?

robot10235 avatar Feb 07 '25 08:02 robot10235

Thanks so much for noticing us. This is really urgent and we are working on this

zhaochenyang20 avatar Feb 07 '25 17:02 zhaochenyang20

Similar problem, on 16xH800

2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
Current thread 0x00007faad27fc700 (most recent call first):
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3435 in all_gather_into_tensor
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 83 in wrapper
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 649 in all_gather
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 768 in forward
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819 in forward
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858 in forward
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 770 in forward_idle
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 787 in forward
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164 in forward_batch_generation
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140 in forward_thread_func_
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109 in forward_thread_func
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/lib/python3.10/threading.py", line 953 in run
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
2025-02-08 23:45:46  
jo-dardhmricga77dqo-worker-1  
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Same problem happens when using vLLM. And disabling CUDA graph doesn't help.

Will try to reproduce it with NCCL debug flags on.

UranusSeven avatar Feb 08 '25 15:02 UranusSeven

Similar problem, on 8xH20.

Image

154912369 avatar Feb 12 '25 03:02 154912369

Does anyone have any ideas for solving it? I encountered the same problem on H100.

 [node11:265  :0:4597] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x52b00e)
==== backtrace (tid:   4597) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004f03d ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:574
 2 0x000000000004f03d ncclLaunchPrepare()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:1275
 3 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:129
 4 0x0000000000053d4b groupLaunch()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:339
 5 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:418
 6 0x0000000000054f88 ncclGroupEndInternal()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/group.cc:368
 7 0x000000000004d74f ncclEnqueueCheck()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/enqueue.cc:2032
 8 0x00000000000452af ncclAllReduce()  /dvs/p4/build/sw/gpgpu/nccl/gitfusion/stable/src/collectives.cc:50
 9 0x00000000011e06ef c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#1}, c10d::ProcessGroupNCCL::collective<c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}>(at::Tensor&, at::Tensor&, c10d::ProcessGroupNCCL::allreduce_impl(at::Tensor&, c10d::AllreduceOptions const&)::{lambda(at::Tensor&, at::Tensor&, ncclComm*, c10::cuda::CUDAStream&)#1}, c10d::OpType, char const*, bool, bool)::{lambda(c10::cuda::CUDAStream&, c10::intrusive_ptr<c10d::ProcessGroupNCCL::WorkNCCL, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupNCCL::WorkNCCL> >&)#2}>()  ProcessGroupNCCL.cpp:0
10 0x00000000011e18ac c10d::ProcessGroupNCCL::allreduce_impl()  ???:0
11 0x00000000011e21a5 c10d::ProcessGroupNCCL::allreduce()  ???:0
12 0x0000000005f8f68e c10d::ops::(anonymous namespace)::allreduce_CUDA()  Ops.cpp:0
13 0x0000000005f9a1d4 c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long>()  :0
14 0x0000000005f9b389 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long> >, false>::call()  :0
15 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
16 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
17 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
18 0x0000000005fa0a35 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, std::optional<at::Tensor> const&, long), void>::call()  :0
19 0x0000000005fae9bd c10d::ProcessGroup::allreduce()  :0
20 0x0000000000df9dc5 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
21 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
22 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
23 0x000000000018139b _PyObject_MakeTpCall()  ???:0
24 0x00000000001987ab PyMethod_New()  ???:0
25 0x000000000017a702 _PyEval_EvalFrameDefault()  ???:0
26 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
27 0x0000000000199172 PyObject_Call()  ???:0
28 0x0000000000177c30 _PyEval_EvalFrameDefault()  ???:0
29 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
30 0x00000000001769ab _PyEval_EvalFrameDefault()  ???:0
31 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
32 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
33 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
34 0x00000000009cabc0 pybind11::detail::object_api<pybind11::handle>::operator()<(pybind11::return_value_policy)1, pybind11::detail::args_proxy, pybind11::detail::kwargs_proxy>()  :0
35 0x0000000000cf4999 torch::impl::dispatch::PythonKernelHolder::operator()()  :0
36 0x00000000055b224b c10::OperatorHandle::redispatchBoxed()  :0
37 0x00000000055afad9 torch::autograd::basicAutogradNotImplementedFallbackImpl()  autograd_not_implemented_fallback.cpp:0
38 0x0000000001a8c3f8 c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>()  VariableFallbackKernel.cpp:0
39 0x0000000000cff728 c10::Dispatcher::callBoxed()  ???:0
40 0x0000000000a8e136 torch::jit::invokeOperatorFromPython()  ???:0
41 0x0000000000a8e447 torch::jit::_get_operation_for_overload_or_packet()  ???:0
42 0x0000000000976c22 pybind11::cpp_function::initialize<torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}, pybind11::object, pybind11::args const&, pybind11::kwargs const&, pybind11::name, pybind11::doc>(torch::jit::initJITBindings(_object*)::{lambda(std::string const&)#217}::operator()(std::string const&) const::{lambda(pybind11::args const&, pybind11::kwargs const&)#1}&&, pybind11::object (*)(pybind11::args const&, pybind11::kwargs const&), pybind11::name const&, pybind11::doc const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
43 0x00000000004cb474 pybind11::cpp_function::dispatcher()  :0
44 0x000000000018ab32 PyObject_CallFunctionObjArgs()  ???:0
45 0x000000000019910b PyObject_Call()  ???:0
46 0x000000000017b6ef _PyEval_EvalFrameDefault()  ???:0
47 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
48 0x000000000018061d _PyObject_FastCallDictTstate()  ???:0
49 0x000000000019562c _PyObject_Call_Prepend()  ???:0
50 0x000000000029d464 PyInit__datetime()  ???:0
51 0x000000000018139b _PyObject_MakeTpCall()  ???:0
52 0x000000000017b99e _PyEval_EvalFrameDefault()  ???:0
53 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
54 0x000000000017597f _PyEval_EvalFrameDefault()  ???:0
55 0x000000000018b38c _PyFunction_Vectorcall()  ???:0
56 0x0000000000175790 _PyEval_EvalFrameDefault()  ???:0
=================================
Fatal Python error: Segmentation fault

jt-z avatar Feb 13 '25 07:02 jt-z

There was the --enable-dp-attention option in my previous code startup options. After I turned off this option, the model inferred hundreds of requests normally. I noticed that other friends' commands did not turn this option on. I'm not sure whether it was caused by this option.

jt-z avatar Feb 13 '25 09:02 jt-z

cc @FrankLeeeee Could you take a look

zhaochenyang20 avatar Feb 14 '25 00:02 zhaochenyang20

I am looking into this issue, the error occurs during the model forward pass and based on @jt-z 's description, it is relevant to dp attention, may I know how long does it take to see this error once the server is booted? @jt-z

FrankLeeeee avatar Feb 14 '25 02:02 FrankLeeeee

I found that this problem seems to occur because this option is turned on after the image updated yesterday. My impression is that there was no problem when the dp option was turned on in the image with the CUDA 12.4 logo. After the image was updated to display the triton icon and then added the dp option, this error occurred after an inference request was successfully deployed. Image

jt-z avatar Feb 14 '25 16:02 jt-z

I used R1 with 2x8 H100, didn't turn on --enable-dp-attention and got the same error.

xiayandi avatar Feb 15 '25 01:02 xiayandi

I encountered the same issue. The error occurs randomly, for example: [master0:107 :0:107] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1) Fatal Python error: Segmentation fault

phoenixsqf avatar Feb 24 '25 08:02 phoenixsqf

Sorry for pointing this once again. I am losing track of DPSK issue. How is going? @FrankLeeeee

zhaochenyang20 avatar Feb 24 '25 09:02 zhaochenyang20

Got the same error with 2x8 H100 and without turning on --enable-dp-attention.

HermitSun avatar Feb 28 '25 10:02 HermitSun

Enable the “--disable-custom-all-reduce” parameter to see if it can circumvent the issue.

kuaikuai avatar Mar 03 '25 01:03 kuaikuai

n my case, I found that after reverting the version from v0.4.3 to v0.4.2.post3, the crash has not occurred again. Could this be caused by some changes introduced between these versions?

HermitSun avatar Mar 03 '25 02:03 HermitSun

Unfortunately, I have not found the specific reason. What I tried now is to restart the instance once I got such error. Sometimes it will work.

Same solution, but some instances trigger restarts too often, resulting in affecting the service effect

I update nccl-2.21.5 to nccl-2.25.1, the crash problem did not occur again

sitabulaixizawaluduo avatar Mar 05 '25 02:03 sitabulaixizawaluduo

I got the same issue, Did this have solved?

xiaoxlm avatar Mar 20 '25 05:03 xiaoxlm

same issue!

mgw2168-1 avatar Apr 17 '25 07:04 mgw2168-1

I believe this is also a problem with the NCCL library itself. I also had to force an upgrade from nccl-2.21.5 to nccl-2.25.1, and the problem was resolved, although pip would prompt that the torch and nccl versions do not match, but the model can run normally.

jt-z avatar Apr 18 '25 01:04 jt-z