sglang
sglang copied to clipboard
[FIX] Add hidden_size attribute to FusedMoE to bypass the deployment error of Qwen3-30B-A3B-AWQ
…t error of Qwen3-30B-A3B-AQW
Motivation
Resolve the issue #6000.
Modifications
Add hidden_size attribute to FusedMoE class.
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [x] Add unit tests as outlined in the Running Unit Tests.
- [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
I am also interested in this pr. I wanted to deploy the 235B in AWQ.
I am also interested in this pr. I wanted to deploy the 235B in AWQ.
I think 235B in AWQ still encounters the problem stated in the issue as you need tp, which may be due to vLLM or something in the middle.
I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
You can also do so by running:
export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:
#!/bin/bash
. /mnt/no-backup/sglang-venv/bin/activate
export RAY_memory_monitor_refresh_ms=0
python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile
I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) -
1c2bc7ead019cdf5b04b2f1d07b00982352f85efYou can also do so by running:
export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whlCrazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:
#!/bin/bash . /mnt/no-backup/sglang-venv/bin/activate export RAY_memory_monitor_refresh_ms=0 python3 -m sglang.launch_server \ --model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \ --served-model-name=qwen3-30b-a3b \ --quantization="awq_marlin" \ --context-length=40960 \ --chunked-prefill-size 30000 \ --disable-custom-all-reduce \ --tensor-parallel-size=2 \ --tool-call-parser qwen25 \ --mem-fraction-static=0.75 \ --sampling-backend pytorch \ --host=0.0.0.0 --port=5000 \ --enable-torch-compile
Hi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.
I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) -
1c2bc7ead019cdf5b04b2f1d07b00982352f85efYou can also do so by running:
export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whlCrazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:
#!/bin/bash . /mnt/no-backup/sglang-venv/bin/activate export RAY_memory_monitor_refresh_ms=0 python3 -m sglang.launch_server \ --model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \ --served-model-name=qwen3-30b-a3b \ --quantization="awq_marlin" \ --context-length=40960 \ --chunked-prefill-size 30000 \ --disable-custom-all-reduce \ --tensor-parallel-size=2 \ --tool-call-parser qwen25 \ --mem-fraction-static=0.75 \ --sampling-backend pytorch \ --host=0.0.0.0 --port=5000 \ --enable-torch-compileHi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.
Yes, I combined above instructions with the PR to run Qwen3-30B-A3B using TP=2. This probably also makes the larger Qwen3 runnable (I don't have the hardware to test).
very good, it works !
Great! I encountered the same fusedmoe hidden_size issue. I modified the line as the PR indicates and it worked for qwen3-235b too, with --tp 4.
太棒了!我也遇到了同样的 fusedmoe hidden_size 问题。我按照 PR 的提示修改了这一行,在 qwen3-235b 上也成功了,并且使用了 --tp 4。
When running qwen3-235b-awq, I encountered more issues. Following the PR instructions, I resolved the fusedmoe hidden_size problem by installing a specific previous version of VLLM, but ultimately faced the error: "tensor model parallel is not initialized".
Since I'm using eight 4090D GPUs, I set TP=8. The documentation for qwen3-235b-int4 mentioned it only supports up to TP=4 maximum - I'm wondering if this might be a similar limitation.
I can run qwen3-235b-awq using the latest VLLM version, but only get 5 tokens/second performance, which is quite strange.
When running qwen3-235b-awq, I encountered more issues. Following the PR instructions, I resolved the fusedmoe hidden_size problem by installing a specific previous version of VLLM, but ultimately faced the error: "tensor model parallel is not initialized".