[Bug] fused_moe OOM when run deepseek-r1 with --speculative-algo NEXTN
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
When use --speculative-algo NEXTN, server got OOM after run about 30min, error log:
opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
warnings.warn('resource_tracker: process died unexpectedly, '
[2025-02-17 09:44:25 TP15] Scheduler hit an exception: Traceback (most recent call last):
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
scheduler.event_loop_normal()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 478, in event_loop_normal
result = self.run_batch(batch)
File "/kesgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1089, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/kesgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 143, in forward_batch_speculative_generation
logits_output, next_token_ids = self.target_worker.forward_batch_generation(
File "/kesgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 795, in forward
return self.forward_extend(forward_batch)
File "/kesgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 760, in forward_extend
return self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 871, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 832, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 787, in forward
hidden_states = self.mlp(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
return fused_experts(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 851, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 731, in inplace_fused_experts
fused_experts_impl(
File "/kesgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 946, in fused_experts_impl
intermediate_cache3 = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 7 has a total capacity of 95.00 GiB of which 638.31 MiB is free. Process 2047768 has 94.37 GiB memory in use. Of the allocat
ed memory 86.76 GiB is allocated by PyTorch, with 3.50 GiB allocated in private pools (e.g., CUDA Graphs), and 2.48 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is la
rge try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variable
s)
Reproduction
I use 2 H20*8 run model deepseek-r1, rank0 server start cmd:
python3 -m sglang.launch_server --model $MODEL_PATH --tp 16 --nccl-init-addr $SERVER_HOST:$SERVER_PORT --port 8000 --host 0.0.0.0 --nnodes 2 --node-rank 0 --trust-remote-code --speculative-algo NEXTN --speculative-draft $MTP_MODEL_PATH --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --disable-radix --watchdog-timeout 1800 --enable-torch-compile --torch-compile-max-bs 2 --log-requests --served-model-name deepseek-r1 --mem-fraction-static 0.9
Environment
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.07
PyTorch: 2.5.1+cu124
sglang: 0.4.3
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post1+cu124torch2.5
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 23.1
psutil: 6.1.1
pydantic: 2.10.1
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX PHB NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PHB PIX NODE SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX PHB 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE PHB PIX 96-191,288-383 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX PHB NODE SYS SYS SYS SYS NODE X PHB NODE SYS SYS SYS SYS
NIC2 NODE PHB PIX NODE SYS SYS SYS SYS NODE PHB X NODE SYS SYS SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX PHB SYS SYS SYS SYS NODE NODE X PHB
NIC7 SYS SYS SYS SYS NODE NODE PHB PIX SYS SYS SYS SYS NODE NODE PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
NIC4: mlx5_bond_4
NIC5: mlx5_bond_5
NIC6: mlx5_bond_6
NIC7: mlx5_bond_7
ulimit soft: 1048576
The first cache intermediate_cache1 can be released when we compute intermediate_cache3. How about reusing intermediate_cache1? We can achieve this by updating their initialization.
I also occury this error using 8*H20 after merge #3692
[2025-02-21 10:35:12 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 313, in capture
yield
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 352, in graph_capture
yield graph_capture_context
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 944, in graph_capture
yield context
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 304, in capture
) = self.capture_one_batch_size(bs, forward)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 397, in capture_one_batch_size
out = run_once()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 380, in run_once
logits_output = forward(input_ids, forward_batch.positions, forward_batch)
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/models/deepseek_v2.py", line 875, in forward
return self.logits_processor(
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 170, in forward
logits = self._get_logits(pruned_states, lm_head, logits_metadata)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/layers/logits_processor.py", line 248, in _get_logits
logits = logits[:, : self.config.vocab_size].float()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 62.00 MiB. GPU 1 has a total capacity of 95.22 GiB of which 62.81 MiB is free. Process 1422778 has 95.15 GiB memory in use. Of the allocated memory 90.16 GiB is allocated by PyTorch, with 423.26 MiB allocated in private pools (e.g., CUDA Graphs), and 226.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 237, in __init__
self.capture()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 287, in capture
with graph_capture() as graph_capture_context:
File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 941, in graph_capture
with get_tp_group().graph_capture() as context, get_pp_group().graph_capture(
File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/parallel_state.py", line 325, in graph_capture
with torch.cuda.stream(stream), maybe_ca_context:
File "/opt/app/python3.10/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 317, in capture
self.register_graph_buffers()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/distributed/device_communicators/custom_all_reduce.py", line 329, in register_graph_buffers
dist.broadcast_object_list(
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3129, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/app/python3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2425, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.24.16.44]:31892
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
self.tp_worker = TpWorkerClass(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
self.model_runner = ModelRunner(
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 224, in __init__
self.init_cuda_graphs()
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 741, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/opt/app/python3.10/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 239, in __init__
raise Exception(
Exception: Capture cuda graph failed: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [10.24.16.44]:31892
@zhyncs we also have this error.
same error, anyone knows can https://github.com/sgl-project/sglang/pull/3692 solve the problem or not?
[2025-03-14 07:36:11 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 172, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 909, in forward
return self.forward_extend(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 870, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1084, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1038, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 988, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 197, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 608, in forward
final_hidden_states = self.quant_method.apply(
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 860, in apply
return fused_experts(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 889, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 758, in inplace_fused_experts
fused_experts_impl(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 984, in fused_experts_impl
cache = torch.empty(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.00 GiB. GPU 1 has a total capacity of 79.10 GiB of which 5.49 GiB is free. Process 1555591 has 73.60 GiB memory in use. Of the allocated memory 68.95 GiB is allocated by PyTorch, with 2.09 GiB allocated in private pools (e.g., CUDA Graphs), and 48.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
also have this error in 0.4.5 post1 please reopen this issue, thanks~