sglang [Bug] fused_moe OOM when run deepseek-r1 with --speculative-algo NEXTN

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

When use --speculative-algo NEXTN, server got OOM after run about 30min, error log:

opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
[2025-02-17 09:44:25 TP15] Scheduler hit an exception: Traceback (most recent call last):
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
    result = self.run_batch(batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 143, in forward_batch_speculative_generation
    logits_output, next_token_ids = self.target_worker.forward_batch_generation(
  File "/kesgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
    return self.forward_extend(forward_batch)
  File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
    return self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 871, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 832, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 787, in forward
    hidden_states = self.mlp(hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
    final_hidden_states = self.quant_method.apply(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
    return fused_experts(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
    fused_experts_impl(
  File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 946, in fused_experts_impl
    intermediate_cache3 = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 7 has a total capacity of 95.00 GiB of which 638.31 MiB is free. Process 2047768 has 94.37 GiB memory in use. Of the allocat
ed memory 86.76 GiB is allocated by PyTorch, with 3.50 GiB allocated in private pools (e.g., CUDA Graphs), and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is la
rge try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variable
s)

Reproduction

I use 2 H20*8 run model deepseek-r1, rank0 server start cmd:

python3 -m sglang.launch_server --model $MODEL_PATH --tp 16 --nccl-init-addr $SERVER_HOST:$SERVER_PORT --port 8000 --host 0.0.0.0 --nnodes 2 --node-rank 0 --trust-remote-code --speculative-algo NEXTN --speculative-draft $MTP_MODEL_PATH --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --watchdog-timeout 1800 --enable-torch-compile --torch-compile-max-bs 2  --log-requests --served-model-name deepseek-r1 --mem-fraction-static 0.9

Environment

Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.07
PyTorch: 2.5.1+cu124
sglang: 0.4.3
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 23.1
psutil: 6.1.1
pydantic: 2.10.1
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     0-95,192-287    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    96-191,288-383  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    96-191,288-383  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     96-191,288-383  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     96-191,288-383  1               N/A
NIC0    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    NODE    PIX     PHB     NODE    SYS     SYS     SYS     SYS     NODE     X      PHB     NODE    SYS     SYS     SYS     SYS
NIC2    NODE    PHB     PIX     NODE    SYS     SYS     SYS     SYS     NODE    PHB      X      NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PHB     SYS     SYS     SYS     SYS     NODE    NODE     X      PHB
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    PHB     PIX     SYS     SYS     SYS     SYS     NODE    NODE    PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
  NIC4: mlx5_bond_4
  NIC5: mlx5_bond_5
  NIC6: mlx5_bond_6
  NIC7: mlx5_bond_7


ulimit soft: 1048576

Feb 17 '25 10:02 Lzhang-hub

The first cache intermediate_cache1 can be released when we compute intermediate_cache3. How about reusing intermediate_cache1? We can achieve this by updating their initialization.

Feb 19 '25 07:02 ch-wan

I also occury this error using 8*H20 after merge #3692

[2025-02-21 10:35:12 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 313, in capture
    yield
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 352, in graph_capture
    yield graph_capture_context
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 944, in graph_capture
    yield context
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 397, in capture_one_batch_size
    out = run_once()
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 380, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 875, in forward
    return self.logits_processor(
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 170, in forward
    logits = self._get_logits(pruned_states, lm_head, logits_metadata)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 248, in _get_logits
    logits = logits[:, : self.config.vocab_size].float()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 62.00 MiB. GPU 1 has a total capacity of 95.22 GiB of which 62.81 MiB is free. Process 1422778 has 95.15 GiB memory in use. Of the allocated memory 90.16 GiB is allocated by PyTorch, with 423.26 MiB allocated in private pools (e.g., CUDA Graphs), and 226.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 237, in __init__
    self.capture()
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 287, in capture
    with graph_capture() as graph_capture_context:
  File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 941, in graph_capture
    with get_tp_group().graph_capture() as context, get_pp_group().graph_capture(
  File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 325, in graph_capture
    with torch.cuda.stream(stream), maybe_ca_context:
  File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 317, in capture
    self.register_graph_buffers()
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 329, in register_graph_buffers
    dist.broadcast_object_list(
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3129, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.24.16.44]:31892

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 224, in __init__
    self.init_cuda_graphs()
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 741, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 239, in __init__
    raise Exception(
Exception: Capture cuda graph failed: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.24.16.44]:31892

Feb 21 '25 05:02 YosanHo

@zhyncs we also have this error.

Mar 07 '25 09:03 mpjlu

same error, anyone knows can https://github.com/sgl-project/sglang/pull/3692 solve the problem or not?

[2025-03-14 07:36:11 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 909, in forward
    return self.forward_extend(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 870, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1084, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1038, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 988, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 197, in forward
    self.experts(hidden_states=hidden_states, router_logits=router_logits)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 608, in forward
    final_hidden_states = self.quant_method.apply(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 860, in apply
    return fused_experts(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 889, in fused_experts
    torch.ops.sglang.inplace_fused_experts(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 758, in inplace_fused_experts
    fused_experts_impl(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 984, in fused_experts_impl
    cache = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.00 GiB. GPU 1 has a total capacity of 79.10 GiB of which 5.49 GiB is free. Process 1555591 has 73.60 GiB memory in use. Of the allocated memory 68.95 GiB is allocated by PyTorch, with 2.09 GiB allocated in private pools (e.g., CUDA Graphs), and 48.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Mar 14 '25 07:03 CUHKSZzxy

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

May 14 '25 00:05 github-actions[bot]

also have this error in 0.4.5 post1 please reopen this issue, thanks~

May 14 '25 08:05 ehuaa