sglang [Bug] RuntimeError: RMSNorm failed with error code invalid configuration argument

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 5. Please use English, otherwise it will be closed.

Describe the bug

Hi, I am using the main branch of SGLang, and downloading Mixtral-8x22B from huggingface.

CUDA: 12.4 2 nodes, each has 4 H100 96GB.

I am deploying the server using:

python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 0 --trust-remote-code --disable-cuda-graph
python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 1 --trust-remote-code --disable-cuda-graph

And I am running the MMLU benchmark:

cd sglang/benchmark/mmlu
python3 bench_sglang.py --nsub 10

It pops out the error:

[2025-02-04 21:18:29 DP3 TP3] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
    self.forward_thread_func_()
  File "python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "sglang/python/sglang/srt/model_executor/model_runner.py", line 787, in forward
    return self.forward_idle(forward_batch)
  File "sglang/python/sglang/srt/model_executor/model_runner.py", line 770, in forward_idle
    return self.model.forward(
  File "sglang/python/sglang/srt/models/mixtral.py", line 314, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "sglang/python/sglang/srt/models/mixtral.py", line 286, in forward
    hidden_states, residual = layer(
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "sglang/python/sglang/srt/models/mixtral.py", line 232, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "python3.10/site-packages/vllm/model_executor/custom_op.py", line 26, in forward
    return self._forward_method(*args, **kwargs)
  File "sglang/python/sglang/srt/layers/layernorm.py", line 59, in forward_cuda
    out = rmsnorm(x, self.weight.data, self.variance_epsilon)
  File "python3.10/site-packages/sgl_kernel/ops/__init__.py", line 156, in rmsnorm
    torch.ops.sgl_kernels.rmsnorm(out, input, weight, eps, _get_cuda_stream(device))
  File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "python3.10/site-packages/torch/utils/_device.py", line 106, in __torch_function__
    return func(*args, **kwargs)
  File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: RMSNorm failed with error code invalid configuration argument

Reproduction

Model: Mixtral 8x22B Script: MMLU benchmark

Please see above.

Environment

Python: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA H100
GPU 0,1,2,3 Compute Capability: 9.0
CUDA_HOME: cuda/gcc/11.3.1/12.4.1-r5e7ajh
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0

Feb 05 '25 02:02 YJHMITWEB

@jhinpan This bug is clear. I think you can try to set up SGLang dev enviroments and reproduce to check what's wrong.

Feb 05 '25 04:02 zhaochenyang20

i meet the same error!

Feb 14 '25 03:02 lwj2001

Oh. Thanks. We can catch up for this and find someone to fix. @jhinpan would you like if I ask others on this and you can help with them?

Feb 14 '25 16:02 zhaochenyang20

Let me give it a try. Thanks

Oh. Thanks. We can catch up for this and find someone to fix. @jhinpan would you like if I ask others on this and you can help with them?

Feb 14 '25 17:02 Ziyi-Wang

@Ziyi-Wang Great. If you need any help, please feel free to reach out.

Feb 14 '25 19:02 zhaochenyang20

Yeah ty @Ziyi-Wang . If you meet with any error, just cc me. I can help take a look when I have time as well.

Feb 14 '25 19:02 jhinpan

@Ziyi-Wang @jhinpan Great! Thanks a lot!

Feb 15 '25 01:02 zhaochenyang20

docker run --gpus '"device=1,2,3,4"'
--shm-size 32g
-p 8000:8000
-v /home/server/DeepSeek-R1-Distill-Qwen-32B-AWQ:/DeepSeek-R1-Distill-Qwen-32B-AWQ
--ipc=host
lmsysorg/sglang:latest
python3 -m sglang.launch_server --model-path /DeepSeek-R1-Distill-Qwen-32B-AWQ --host 0.0.0.0 --port 8000 --tp 4 --trust-remote-code --watchdog-timeout 36000 --disable-cuda-graph --mem-fraction-static 0.9 --context-length 4096 --enable-dp-attention

when I add --enable-dp-attention option, "RuntimeError: RMSNorm failed with error code invalid configuration argument " occur, but if remove this option, error not occur, but throughtput is low.

I use docker images: lmsysorg/sglang: latest hash a24698f5bb2 @jhinpan

Feb 26 '25 13:02 hariag

same problem

Mar 12 '25 14:03 zhang17173

this seems to be docker errror?

Mar 13 '25 09:03 zhaochenyang20

same issue here, still exists when running without docker

Mar 20 '25 08:03 ZeppLu

cc @merrymercy

Mar 20 '25 16:03 zhaochenyang20

same issue, although i use docker.i need delete --enable-dp-attention setting to avoid this problem but my gen speed get worse.

Mar 25 '25 06:03 Sos-Zachary

python -m sglang.launch_server --model-path /odb/zh/gte_Qwen2-7B-instruct 
--host 0.0.0.0 --is-embedding

When using sglang to run the embedding model with the OpenAI SDK, an empty input (input="") reliably causes a RuntimeError: RMSNorm failed (invalid configuration argument).

Apr 01 '25 08:04 llmadd

i meet the same error!

Apr 08 '25 05:04 ZAntonyH

same error if turn on parameter --disable-cuda-graph (image: lmsysorg/sglang:v0.4.5-cu121)

python3 -m sglang.launch_server \
--model /models/deepseek-r1-distill-qwen-7b \
--tp 2 \
--dp 2 \
--enable-dp-attention \
--disable-cuda-graph

Apr 13 '25 01:04 fyuan1316

same error, anyone has a solution？ my env is：

sgl-kernel                0.0.8
sglang                    0.4.5
flashinfer-python         0.2.3
torch                     2.5.1
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-ml-py              12.570.86
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127

Apr 15 '25 04:04 fanxing11

same error, anyone has a solution？ my env is：

sgl-kernel                0.0.8
sglang                    0.4.5
flashinfer-python         0.2.3
torch                     2.5.1
nvidia-cublas-cu12        12.4.5.8
nvidia-cuda-cupti-cu12    12.4.127
nvidia-cuda-nvrtc-cu12    12.4.127
nvidia-cuda-runtime-cu12  12.4.127
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.2.1.3
nvidia-curand-cu12        10.3.5.147
nvidia-cusolver-cu12      11.6.1.9
nvidia-cusparse-cu12      12.3.1.170
nvidia-ml-py              12.570.86
nvidia-nccl-cu12          2.21.5
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.4.127

Add a --chat-template qwen2-vl when serving Qwen-2.5-VL series solve the my problem, thx.

Apr 16 '25 10:04 fanxing11

Is it really fixed? I got the same error with sglang=0.4.6

Jun 04 '25 10:06 hyyp1

Add a --chat-template qwen2-vl thx.

Jun 06 '25 08:06 Chenhaolin6

I've got similar problem here: https://github.com/sgl-project/sglang/issues/7249

Jun 16 '25 20:06 celsowm

When deploying with DP+EP, if the EP scale is less than 32 and "moe_dense_tp_size=1" is enabled, you may encounter the error as follow: FusedAddRMSNorm: failed with error code invalid configuration argument. In this case, you need to remove this parameter.

Jun 23 '25 08:06 yunkchen

Same issue when hosting DeepSeek-R1-Distill-Qwen-14B with commands python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp 8 --dist-init-addr ${IP}:5000 --trust-remote-code --host 0.0.0.0 --port 30000 --enable-dp-attention --dp-size 8 --enable-torch-compile --torch-compile-max-bs 8 on 8 H100s.

Jul 18 '25 21:07 zhourunlong

It seems it was closed by accident in https://github.com/sgl-project/sglang/pull/5621, it doesn't actually fix it as it occurs regardless of the embeddings endpoint. I will reopen this since it seems like a more fundamental environment issue

Sep 08 '25 19:09 b8zhong