sglang [FIX] Add hidden_size attribute to FusedMoE to bypass the deployment error of Qwen3-30B-A3B-AWQ

…t error of Qwen3-30B-A3B-AQW

Motivation

Resolve the issue #6000.

Modifications

Add hidden_size attribute to FusedMoE class.

Checklist

[x] Format your code according to the Code Formatting with Pre-Commit.
[x] Add unit tests as outlined in the Running Unit Tests.
[x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
[x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
[x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
[x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

May 05 '25 12:05 SecretSettler

I am also interested in this pr. I wanted to deploy the 235B in AWQ.

May 06 '25 10:05 chriswritescode-dev

I am also interested in this pr. I wanted to deploy the 235B in AWQ.

I think 235B in AWQ still encounters the problem stated in the issue as you need tp, which may be due to vLLM or something in the middle.

May 06 '25 10:05 SecretSettler

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:

export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:

#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile

May 06 '25 23:05 JohnTheNerd

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:
export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:
#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile

Hi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.

May 07 '25 12:05 SecretSettler

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:
export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:
#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile
Hi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.

Yes, I combined above instructions with the PR to run Qwen3-30B-A3B using TP=2. This probably also makes the larger Qwen3 runnable (I don't have the hardware to test).

May 07 '25 12:05 JohnTheNerd

very good, it works !

May 11 '25 07:05 zcfrank1st

Great! I encountered the same fusedmoe hidden_size issue. I modified the line as the PR indicates and it worked for qwen3-235b too, with --tp 4.

May 11 '25 23:05 idontwantagirlfriend

太棒了！我也遇到了同样的 fusedmoe hidden_size 问题。我按照 PR 的提示修改了这一行，在 qwen3-235b 上也成功了，并且使用了 --tp 4。 When running qwen3-235b-awq, I encountered more issues. Following the PR instructions, I resolved the fusedmoe hidden_size problem by installing a specific previous version of VLLM, but ultimately faced the error: "tensor model parallel is not initialized".

Since I'm using eight 4090D GPUs, I set TP=8. The documentation for qwen3-235b-int4 mentioned it only supports up to TP=4 maximum - I'm wondering if this might be a similar limitation.

I can run qwen3-235b-awq using the latest VLLM version, but only get 5 tokens/second performance, which is quite strange.

May 12 '25 14:05 svenlancelo

sglang sglang copied to clipboard

[FIX] Add hidden_size attribute to FusedMoE to bypass the deployment error of Qwen3-30B-A3B-AWQ

Motivation

Modifications

Checklist

sglang
sglang copied to clipboard