[Bug] DeepSeek-V3 an illegal memory access was encountered
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
lmsysorg/sglang:v0.4.2.post1-cu125
image shows memory access fault on large batch size, such as 32 with bench_one_batch benchmark
CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.bench_one_batch --batch-size 32 --input 2048 --output 2 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --disable-cuda-graph
gives, this
[2025-02-05 21:40:57 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for MoE layer.
Process Process-4:
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test
latency_test_run_once(
File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 354, in latency_test_run_once
next_token_ids, _, batch = extend(reqs, model_runner)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 243, in extend
logits_output = model_runner.forward(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785, in forward
return self.forward_extend(forward_batch)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750, in forward_extend
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 774, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward
self.experts(hidden_states=hidden_states, router_logits=router_logits)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward
final_hidden_states = self.quant_method.apply(
File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply
return fused_experts(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 843, in fused_experts
torch.ops.sglang.inplace_fused_experts(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call
return self._op(*args, **(kwargs or {}))
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 723, in inplace_fused_experts
fused_experts_impl(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1006, in fused_experts_impl
invoke_fused_moe_kernel(
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 511, in invoke_fused_moe_kernel
fused_moe_kernel[grid](
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in
Reproduction
CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.bench_one_batch --batch-size 32 --input 2048 --output 2 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --disable-cuda-graph
Environment
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
- 'fields' has been removed warnings.warn(message, UserWarning) Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H200 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.5, V12.5.82 CUDA Driver Version: 555.42.06 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post1 flashinfer: 0.1.6+cu124torch2.4 triton: 3.1.0 transformers: 4.48.2 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.11 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.60.2 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE 32-63,160-191 1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS 64-95,192-223 2 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS 64-95,192-223 2 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS 64-95,192-223 2 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS 96-127,224-255 3 N/A NIC0 SYS SYS SYS NODE SYS SYS SYS SYS X PIX NIC1 SYS SYS SYS NODE SYS SYS SYS SYS PIX X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0 NIC1: mlx5_1
ulimit soft: 1048576
Don’t use bench_one_bench
Hi @zhyncs thank you for quick rely!
Can you elaborate a little bit differences btw bench_one_bench and launch_server + bench_serving ? Does bench_one_bench process batch*intput_tokens differently from online serving ?
mark
mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.