Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

lmsysorg/sglang:v0.4.2.post1-cu125

image shows memory access fault on large batch size, such as 32 with bench_one_batch benchmark

CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.bench_one_batch --batch-size 32 --input 2048 --output 2 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --disable-cuda-graph

gives, this

[2025-02-05 21:40:57 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for MoE layer. Process Process-4: Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 432, in latency_test latency_test_run_once( File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 354, in latency_test_run_once next_token_ids, _, batch = extend(reqs, model_runner) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/bench_one_batch.py", line 243, in extend logits_output = model_runner.forward(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 785, in forward return self.forward_extend(forward_batch) File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 750, in forward_extend return self.model.forward( File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 858, in forward hidden_states = self.model(input_ids, positions, forward_batch) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 819, in forward hidden_states, residual = layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 774, in forward hidden_states = self.mlp(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 177, in forward self.experts(hidden_states=hidden_states, router_logits=router_logits) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 589, in forward final_hidden_states = self.quant_method.apply( File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/fp8.py", line 820, in apply return fused_experts( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 843, in fused_experts torch.ops.sglang.inplace_fused_experts( File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in call return self._op(*args, **(kwargs or {})) File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 723, in inplace_fused_experts fused_experts_impl( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 1006, in fused_experts_impl invoke_fused_moe_kernel( File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py", line 511, in invoke_fused_moe_kernel fused_moe_kernel[grid]( File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Reproduction

CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.bench_one_batch --batch-size 32 --input 2048 --output 2 --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --disable-cuda-graph

Environment

/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

'fields' has been removed warnings.warn(message, UserWarning) Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H200 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.5, V12.5.82 CUDA Driver Version: 555.42.06 PyTorch: 2.5.1+cu124 sglang: 0.4.2.post1 flashinfer: 0.1.6+cu124torch2.4 triton: 3.1.0 transformers: 4.48.2 torchao: 0.8.0 numpy: 1.26.4 aiohttp: 3.11.11 fastapi: 0.115.8 hf_transfer: 0.1.9 huggingface_hub: 0.28.1 interegular: 0.3.3 modelscope: 1.22.3 orjson: 3.10.15 packaging: 24.2 psutil: 6.1.1 pydantic: 2.10.6 multipart: 0.0.20 zmq: 26.2.1 uvicorn: 0.34.0 uvloop: 0.21.0 vllm: 0.6.4.post1 openai: 1.60.2 anthropic: 0.45.2 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS 0-31,128-159 0 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE 32-63,160-191 1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS 64-95,192-223 2 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS 64-95,192-223 2 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS 64-95,192-223 2 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS 96-127,224-255 3 N/A NIC0 SYS SYS SYS NODE SYS SYS SYS SYS X PIX NIC1 SYS SYS SYS NODE SYS SYS SYS SYS PIX X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1

ulimit soft: 1048576

Feb 06 '25 08:02 seungrokj

Don’t use bench_one_bench

Feb 06 '25 08:02 zhyncs

Hi @zhyncs thank you for quick rely!

Can you elaborate a little bit differences btw bench_one_bench and launch_server + bench_serving ? Does bench_one_bench process batch*intput_tokens differently from online serving ?

Feb 06 '25 08:02 seungrokj

mark

Feb 20 '25 10:02 kuangdao

mark! the same problem with tp 16 on 2 H800*8 source version 3c7bfd7eabed5e29cf907dba3e2ed875d7a92fd4

Feb 20 '25 10:02 YEXINGZHE54

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

Apr 22 '25 00:04 github-actions[bot]

[Bug] DeepSeek-V3 an illegal memory access was encountered

Checklist

Describe the bug

gives, this

Reproduction

Environment