vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: sm75 can not serve qwen3 bnb 4bit model

Open HuChundong opened this issue 7 months ago • 3 comments

Your current environment

docker image v0.8.5

vllm-openai-1 | (VllmWorkerProcess pid=149) WARNING 04-28 18:00:58 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules vllm-openai-1 | WARNING 04-28 18:00:58 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method load_model. vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] Traceback (most recent call last): vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] return func(*args, **kwargs) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] self.model_runner.load_model() vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] self.model = get_model(vllm_config=self.vllm_config) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] return loader.load_model(vllm_config=vllm_config)

🐛 Describe the bug

vllm-openai: runtime: nvidia restart: always deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '2', '3'] capabilities: [ gpu ] volumes: - ~/.cache/huggingface:/root/.cache/huggingface - /home/hucd/models:/models environment: - HUGGING_FACE_HUB_TOKEN= - CUDA_VISIBLE_DEVICES=0,1 ports: - 8001:8000 ipc: host image: vllm/vllm-openai:v0.8.5 command: --model /models/Qwen3-30B-A3B-bnb-4bit --served-model-name qwen3-a3b --tensor_parallel_size 2 --max_model_len 8192 --dtype half --max_num_seqs 1 --gpu_memory_utilization 0.9 --enable-reasoning --reasoning-parser deepseek_r1

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

HuChundong avatar Apr 29 '25 01:04 HuChundong

First, this model lacks packed_modules_mapping, and second, MOE does not support BNB, so it cannot be supported at the moment

jeejeelee avatar Apr 29 '25 11:04 jeejeelee

Same issue with the GPTQ versions of Qwen3-30B-A3B.

benjamin-marie avatar Apr 29 '25 21:04 benjamin-marie

also AWQ version not work

HuChundong avatar Apr 30 '25 16:04 HuChundong

+1 AWQ

seasoncool avatar May 01 '25 11:05 seasoncool

how do we use qwen 3 moe quantized?

DaBossCoda avatar May 01 '25 14:05 DaBossCoda

+1 AWQ

rascazzione avatar May 04 '25 11:05 rascazzione

+1 bnb

zcfrank1st avatar May 07 '25 01:05 zcfrank1st

+1 AWQ

HelloCard avatar May 09 '25 13:05 HelloCard

?

DaBossCoda avatar May 12 '25 05:05 DaBossCoda

+1 gptq

anunknowperson avatar Aug 08 '25 10:08 anunknowperson