[Bug] RuntimeError: RMSNorm failed with error code invalid configuration argument
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.
Describe the bug
Hi, I am using the main branch of SGLang, and downloading Mixtral-8x22B from huggingface.
CUDA: 12.4 2 nodes, each has 4 H100 96GB.
I am deploying the server using:
python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 0 --trust-remote-code --disable-cuda-graph
python -m sglang.launch_server --model-path Mixtral-8x22B-v0.1 --tp 8 --dist-init-addr xxx:5000 --nnodes 2 --node-rank 1 --trust-remote-code --disable-cuda-graph
And I am running the MMLU benchmark:
cd sglang/benchmark/mmlu
python3 bench_sglang.py --nsub 10
It pops out the error:
[2025-02-04 21:18:29 DP3 TP3] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 109, in forward_thread_func
self.forward_thread_func_()
File "python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 140, in forward_thread_func_
logits_output, next_token_ids = self.worker.forward_batch_generation(
File "sglang/python/sglang/srt/managers/tp_worker.py", line 164, in forward_batch_generation
logits_output = self.model_runner.forward(forward_batch)
File "sglang/python/sglang/srt/model_executor/model_runner.py", line 787, in forward
return self.forward_idle(forward_batch)
File "sglang/python/sglang/srt/model_executor/model_runner.py", line 770, in forward_idle
return self.model.forward(
File "sglang/python/sglang/srt/models/mixtral.py", line 314, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "sglang/python/sglang/srt/models/mixtral.py", line 286, in forward
hidden_states, residual = layer(
File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "sglang/python/sglang/srt/models/mixtral.py", line 232, in forward
hidden_states = self.input_layernorm(hidden_states)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "python3.10/site-packages/vllm/model_executor/custom_op.py", line 26, in forward
return self._forward_method(*args, **kwargs)
File "sglang/python/sglang/srt/layers/layernorm.py", line 59, in forward_cuda
out = rmsnorm(x, self.weight.data, self.variance_epsilon)
File "python3.10/site-packages/sgl_kernel/ops/__init__.py", line 156, in rmsnorm
torch.ops.sgl_kernels.rmsnorm(out, input, weight, eps, _get_cuda_stream(device))
File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
File "python3.10/site-packages/torch/utils/_device.py", line 106, in __torch_function__
return func(*args, **kwargs)
File "python3.10/site-packages/torch/_ops.py", line 1116, in __call__
return self._op(*args, **(kwargs or {}))
RuntimeError: RMSNorm failed with error code invalid configuration argument
Reproduction
Model: Mixtral 8x22B Script: MMLU benchmark
Please see above.
Environment
Python: 3.10.16 | packaged by conda-forge | (main, Dec 5 2024, 14:16:10) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA H100
GPU 0,1,2,3 Compute Capability: 9.0
CUDA_HOME: cuda/gcc/11.3.1/12.4.1-r5e7ajh
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.2
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.28.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.61.0
anthropic: 0.45.2
decord: 0.6.0
@jhinpan This bug is clear. I think you can try to set up SGLang dev enviroments and reproduce to check what's wrong.
i meet the same error!
Oh. Thanks. We can catch up for this and find someone to fix. @jhinpan would you like if I ask others on this and you can help with them?
Let me give it a try. Thanks
Oh. Thanks. We can catch up for this and find someone to fix. @jhinpan would you like if I ask others on this and you can help with them?
@Ziyi-Wang Great. If you need any help, please feel free to reach out.
Yeah ty @Ziyi-Wang . If you meet with any error, just cc me. I can help take a look when I have time as well.
@Ziyi-Wang @jhinpan Great! Thanks a lot!
docker run --gpus '"device=1,2,3,4"'
--shm-size 32g
-p 8000:8000
-v /home/server/DeepSeek-R1-Distill-Qwen-32B-AWQ:/DeepSeek-R1-Distill-Qwen-32B-AWQ
--ipc=host
lmsysorg/sglang:latest
python3 -m sglang.launch_server --model-path /DeepSeek-R1-Distill-Qwen-32B-AWQ --host 0.0.0.0 --port 8000 --tp 4 --trust-remote-code --watchdog-timeout 36000 --disable-cuda-graph --mem-fraction-static 0.9 --context-length 4096 --enable-dp-attention
when I add --enable-dp-attention option, "RuntimeError: RMSNorm failed with error code invalid configuration argument " occur, but if remove this option, error not occur, but throughtput is low.
I use docker images: lmsysorg/sglang: latest hash a24698f5bb2 @jhinpan
same problem
this seems to be docker errror?
same issue here, still exists when running without docker
cc @merrymercy
same issue, although i use docker.i need delete --enable-dp-attention setting to avoid this problem but my gen speed get worse.
python -m sglang.launch_server --model-path /odb/zh/gte_Qwen2-7B-instruct
--host 0.0.0.0 --is-embedding
When using sglang to run the embedding model with the OpenAI SDK, an empty input (input="") reliably causes a RuntimeError: RMSNorm failed (invalid configuration argument).
i meet the same error!
same error if turn on parameter --disable-cuda-graph
(image: lmsysorg/sglang:v0.4.5-cu121)
python3 -m sglang.launch_server \
--model /models/deepseek-r1-distill-qwen-7b \
--tp 2 \
--dp 2 \
--enable-dp-attention \
--disable-cuda-graph
same error, anyone has a solution? my env is:
sgl-kernel 0.0.8
sglang 0.4.5
flashinfer-python 0.2.3
torch 2.5.1
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
same error, anyone has a solution? my env is:
sgl-kernel 0.0.8 sglang 0.4.5 flashinfer-python 0.2.3 torch 2.5.1 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.570.86 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127
Add a --chat-template qwen2-vl when serving Qwen-2.5-VL series solve the my problem, thx.
Is it really fixed? I got the same error with sglang=0.4.6
Add a --chat-template qwen2-vl thx.
I've got similar problem here: https://github.com/sgl-project/sglang/issues/7249
When deploying with DP+EP, if the EP scale is less than 32 and "moe_dense_tp_size=1" is enabled, you may encounter the error as follow:
FusedAddRMSNorm: failed with error code invalid configuration argument.
In this case, you need to remove this parameter.
Same issue when hosting DeepSeek-R1-Distill-Qwen-14B with commands python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp 8 --dist-init-addr ${IP}:5000 --trust-remote-code --host 0.0.0.0 --port 30000 --enable-dp-attention --dp-size 8 --enable-torch-compile --torch-compile-max-bs 8 on 8 H100s.
It seems it was closed by accident in https://github.com/sgl-project/sglang/pull/5621, it doesn't actually fix it as it occurs regardless of the embeddings endpoint. I will reopen this since it seems like a more fundamental environment issue