vllm [Bug]: sm75 can not serve qwen3 bnb 4bit model

Your current environment

docker image v0.8.5

vllm-openai-1 | (VllmWorkerProcess pid=149) WARNING 04-28 18:00:58 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules vllm-openai-1 | WARNING 04-28 18:00:58 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method load_model. vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] Traceback (most recent call last): vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] return func(*args, **kwargs) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] self.model_runner.load_model() vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] self.model = get_model(vllm_config=self.vllm_config) vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model vllm-openai-1 | (VllmWorkerProcess pid=149) ERROR 04-28 18:00:58 [multiproc_worker_utils.py:238] return loader.load_model(vllm_config=vllm_config)

🐛 Describe the bug

vllm-openai: runtime: nvidia restart: always deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '2', '3'] capabilities: [ gpu ] volumes: - ~/.cache/huggingface:/root/.cache/huggingface - /home/hucd/models:/models environment: - HUGGING_FACE_HUB_TOKEN= - CUDA_VISIBLE_DEVICES=0,1 ports: - 8001:8000 ipc: host image: vllm/vllm-openai:v0.8.5 command: --model /models/Qwen3-30B-A3B-bnb-4bit --served-model-name qwen3-a3b --tensor_parallel_size 2 --max_model_len 8192 --dtype half --max_num_seqs 1 --gpu_memory_utilization 0.9 --enable-reasoning --reasoning-parser deepseek_r1

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.