sglang icon indicating copy to clipboard operation
sglang copied to clipboard

[FIX] Add hidden_size attribute to FusedMoE to bypass the deployment error of Qwen3-30B-A3B-AWQ

Open SecretSettler opened this issue 7 months ago • 8 comments

…t error of Qwen3-30B-A3B-AQW

Motivation

Resolve the issue #6000.

Modifications

Add hidden_size attribute to FusedMoE class.

Checklist

  • [x] Format your code according to the Code Formatting with Pre-Commit.
  • [x] Add unit tests as outlined in the Running Unit Tests.
  • [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
  • [x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
  • [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

SecretSettler avatar May 05 '25 12:05 SecretSettler

I am also interested in this pr. I wanted to deploy the 235B in AWQ.

chriswritescode-dev avatar May 06 '25 10:05 chriswritescode-dev

I am also interested in this pr. I wanted to deploy the 235B in AWQ.

I think 235B in AWQ still encounters the problem stated in the issue as you need tp, which may be due to vLLM or something in the middle.

SecretSettler avatar May 06 '25 10:05 SecretSettler

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:

export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:

#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile

JohnTheNerd avatar May 06 '25 23:05 JohnTheNerd

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:

export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:

#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile

Hi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.

SecretSettler avatar May 07 '25 12:05 SecretSettler

I was able to get the MoE working with TP by installing a later build of vLLM that is not released yet. Specifically, the last commit before PyTorch was upgraded to 2.7.0 (making it incompatible with sglang) - 1c2bc7ead019cdf5b04b2f1d07b00982352f85ef

You can also do so by running:

export VLLM_COMMIT=1c2bc7ead019cdf5b04b2f1d07b00982352f85ef
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Crazy speeds, 3000t/s prefill and 120t/s decode on 2x RTX4060Ti. The exact command I used to start (you probably don't need all of it) is:

#!/bin/bash

. /mnt/no-backup/sglang-venv/bin/activate

export RAY_memory_monitor_refresh_ms=0

python3 -m sglang.launch_server \
--model-path /mnt/no-backup/models/Qwen3-30B-A3B-AWQ \
--served-model-name=qwen3-30b-a3b \
--quantization="awq_marlin" \
--context-length=40960 \
--chunked-prefill-size 30000 \
--disable-custom-all-reduce \
--tensor-parallel-size=2 \
--tool-call-parser qwen25 \
--mem-fraction-static=0.75 \
--sampling-backend pytorch \
--host=0.0.0.0 --port=5000 \
--enable-torch-compile

Hi, thanks! I think this PR fixes the problem when tp=1, if your GPU can fit the Qwen3-30B-A3B.

Yes, I combined above instructions with the PR to run Qwen3-30B-A3B using TP=2. This probably also makes the larger Qwen3 runnable (I don't have the hardware to test).

JohnTheNerd avatar May 07 '25 12:05 JohnTheNerd

very good, it works !

zcfrank1st avatar May 11 '25 07:05 zcfrank1st

Great! I encountered the same fusedmoe hidden_size issue. I modified the line as the PR indicates and it worked for qwen3-235b too, with --tp 4. image

idontwantagirlfriend avatar May 11 '25 23:05 idontwantagirlfriend

太棒了!我也遇到了同样的 fusedmoe hidden_​​size 问题。我按照 PR 的提示修改了这一行,在 qwen3-235b 上也成功了,并且使用了 --tp 4。 图像 When running qwen3-235b-awq, I encountered more issues. Following the PR instructions, I resolved the fusedmoe hidden_size problem by installing a specific previous version of VLLM, but ultimately faced the error: "tensor model parallel is not initialized".

Since I'm using eight 4090D GPUs, I set TP=8. The documentation for qwen3-235b-int4 mentioned it only supports up to TP=4 maximum - I'm wondering if this might be a similar limitation.

I can run qwen3-235b-awq using the latest VLLM version, but only get 5 tokens/second performance, which is quite strange.

svenlancelo avatar May 12 '25 14:05 svenlancelo