[Usage] Qwen3 Usage Guide
vLLM v0.8.4 and higher natively supports all Qwen3 and Qwen3MoE models. Example command:
-
vllm serve Qwen/... --enable-reasoning --reasoning-parser deepseek_r1- All models should work with the command as above. You can test the reasoning parser with the following example script: https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
- Some MoE models might not be divisible by TP 8. Either lower your TP size or use
--enable-expert-parallel.
-
If you are seeing the following error when running fp8 dense models, you are running on vLLM v0.8.4. Please upgrade to v0.8.5.
File ".../vllm/model_executor/parameter.py", line 149, in load_qkv_weight
param_data = param_data.narrow(self.output_dim, shard_offset,
IndexError: start out of range (expected to be in range of [-18, 18], but got 2048)
- If you are seeing the following error when running MoE models with fp8, you are running with too much tensor parallelize degree that the weights are not divisible. Consider
--tensor-parallel-size 4or--tensor-parallel-size 8 --enable-expert-parallel.
File ".../vllm/vllm/model_executor/layers/quantization/fp8.py", line 477, in create_weights
raise ValueError(
ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
how to use MCP with Qwen3
Any plan for speculative decoding?
vllm的启动参数如何支持enable_thinking=True?
vllm的启动参数如何支持enable_thinking=True?
See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes
Consider --tensor-parallel-size 4 or --tensor-parallel-size 8 --enable-expert-parallel.
I am running Qwen3-30B-A3B-FP8 with two A10 GPUs. tp=2 is enough to load the model, does vllm support "tp=2" in this case?
How can I disable reasoning in generative models, i.e. using LLM.chat?
so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?
so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?
Refer to the release notes of 0.8.5 (top line), Yes.
Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (https://github.com/vllm-project/vllm/pull/17318) and adds tuned MoE configs (https://github.com/vllm-project/vllm/pull/17328).
linear.py fixes already existed in #17318
I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer.
so if i upgrade my vllm version from 0.8.4 to 0.8.5,i don't need to make this fix?
Refer to the release notes of 0.8.5 (top line), Yes.
Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
linear.py fixes already existed in #17318
I often fail to follow up on relevant PRs in a timely manner. Thanks for your answer.
vllm的启动参数如何支持enable_thinking=True?
See https://qwen.readthedocs.io/en/latest/deployment/vllm.html#thinking-non-thinking-modes
The official documentation only describes how to turn off Thinking mode when the API is called, it doesn't write how to turn off Thinking mode as soon as vLLM is started, I tried changing the generation_config.json file to turn off Thinking, but it didn't work, it's still in Thinking mode. I also tried adding “chat_template_kwargs”: {“enable_thinking”: false} via the --override-generation-config parameter, but I don't know the correct usage of this parameter and it keeps giving me errors. Here is my generation_config.json and docker startup command:
generation_config.json:
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false},
"transformers_version": "4.51.0"
}
docker run
docker run -d --name Qwen --runtime nvidia --gpus '"device=0,1,2,3"' \
-v /home/models/Qwen3-32B:/root/.cache/modelscope/hub/Qwen3-32B \
-p 18081:8081 \
--ipc=host \
vllm/vllm-openai:v0.8.5 \
--model /root/.cache/modelscope/hub/Qwen3-32B \
--served-model-name Qwen3-32B \
--enable-auto-tool-choice --tool-call-parser hermes \
--chat-template examples/tool_chat_template_hermes.jinja \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--port 8081
ERROR:
--override-generation-config "{'temperature': 0.7,'top_p': 0.8,'top_k': 20,'max_tokens': 8192,'presence_penalty': 1.5,'chat_template_kwargs': {'enable_thinking': false}}"
api_server.py: error: argument --override-generation-config: invalid loads value: "{'temperature': 0.7,'top_p': 0.8,'top_k': 20,'max_tokens': 8192,'presence_penalty': 1.5,'chat_template_kwargs': {'enable_thinking': false}}"
How can I disable reasoning in generative models, i.e. using
LLM.chat?
I have opened #17356 to support this, can you try it?
how to use tool ?
Qwen3 support MCP add
--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \
I'm use docker 0.8.5 to infer
docker run --runtime nvidia --gpus all \
-d \
-v /root/models:/models \
-p 5536:8000 \
--env "HF_HUB_OFFLINE=1" \
--ipc=host \
vllm/vllm-openai:latest \
--model /models/Qwen/Qwen3-235B-A22B-FP8 \
--tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
--generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
--served_model_name Qwen3-235B-A22B-FP8 \
--gpu_memory_utilization 0.9 \
--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \
--host 0.0.0.0 \
--port 8000 \
--enable-expert-parallel \
--tensor-parallel-size 8
GPUs: 4090 48G x 8
I'm doing something wrong or support isn't released atm ? I'm using VLLM 0.8.6dev RTX 5090, Torch 2.7, cu128, bitsandbytes 0.45.5
"vllm", "serve", "unsloth/Qwen3-30B-A3B-bnb-4bit",
"--max-model-len", "2048",
"--enable-reasoning",
"--reasoning-parser", "deepseek_r1",
"--download-dir", "./models",
"--gpu-memory-utilization", "0.7",
"--max-num-seqs", "5",
Error:
.../vllm/model_executor/layers/fused_moe/layer.py", line 499, in __init__
assert self.quant_method is not None
when I use vLLM as Python Library, how can i switch qwen to no-thinking modes?
See #17356
I was using the latest version 0.85 of vllm and running QWEN3-14b_5km. An error was reported. What's the problem? Does vllm currently not support the gguf format of qwen3?
INFO 04-30 00:22:41 [__init__.py:239] Automatically detected platform cuda. INFO 04-30 00:22:43 [api_server.py:1043] vLLM API server version 0.8.5 INFO 04-30 00:22:43 [api_server.py:1044] args: Namespace(host='wslkali', port=12345, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/home/kali/models/Qwen3-14B-GGUF/Qwen3-14B-Q5_K_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser='deepseek_r1', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=1.0, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['Qwen3-14B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=True, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False) Traceback (most recent call last): File "/root/miniconda3/envs/demo/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/demo/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module> uvloop.run(run_server(args)) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server async with build_async_engine_client(args) as engine_client: File "/root/miniconda3/envs/demo/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/miniconda3/envs/demo/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args vllm_config = engine_args.create_engine_config(usage_context=usage_context) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1099, in create_engine_config model_config = self.create_model_config() File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 987, in create_model_config return ModelConfig( File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/config.py", line 451, in __init__ hf_config = get_config(self.hf_config_path or self.model, File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 303, in get_config config_dict, _ = PretrainedConfig.get_config_dict( File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/configuration_utils.py", line 681, in _get_config_dict config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"] File "/root/miniconda3/envs/demo/lib/python3.10/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 401, in load_gguf_checkpoint raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.") ValueError: GGUF model with architecture qwen3 is not supported yet.
Do we have a method yet to do similar decoding to what Qwen does with their demo via a "reasoning budget"? E.g. injecting in the /think after xyz tokens
How to build 0.8.5 with CUDA11.7?
我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB
请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的 vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4 来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。
我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB
请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的
vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。
@Silencezjl 两个方案:
- 使用Qwen3-235B-A22B-FP8,类似我的 4090 48GB x 8的方案。参数
docker run --runtime nvidia --gpus all \
-d \
-v /root/models:/models \
-p 5536:8000 \
--env "HF_HUB_OFFLINE=1" \
--ipc=host \
vllm/vllm-openai:latest \
--model /models/Qwen/Qwen3-235B-A22B-FP8 \
--tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \
--generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \
--served_model_name Qwen3-235B-A22B-FP8 \
--gpu_memory_utilization 0.9 \
--enable-reasoning --reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \
--host 0.0.0.0 \
--port 8000 \
--enable-expert-parallel \
--tensor-parallel-size 8
你将--tensor-parallel-size 8改为4
- 使用
--cpu-offload-gb参数将每张卡应该装载到显存的模型卸载到CPU内存中,这个性能受限于PCIe速度和内存带宽,卸载到内存中的模型每次前向传播都会装载到GPU的显存中。
vLLM: v0.8.5 Model: Qwen/Qwen3-30B-A3B Hardware: A10*4, 96GB VRAM
Gives OOM, even i set max-model-len to 1024, with max-num-seq =1
Works with enforce-eager. Gives 20 token per seconds.
logs
qwen3-1 | INFO 04-29 00:57:42 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | INFO 04-29 00:57:50 [api_server.py:1043] vLLM API server version 0.8.5
qwen3-1 | INFO 04-29 00:57:50 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='sk-secret', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen3-30B-A3B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=64, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
qwen3-1 | INFO 04-29 00:58:00 [config.py:717] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
qwen3-1 | INFO 04-29 00:58:01 [config.py:1770] Defaulting to use mp for distributed inference
qwen3-1 | INFO 04-29 00:58:01 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
qwen3-1 | INFO 04-29 00:58:07 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | INFO 04-29 00:58:10 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
qwen3-1 | WARNING 04-29 00:58:10 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 24 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
qwen3-1 | INFO 04-29 00:58:10 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_00050c59'), local_subscribe_addr='ipc:///tmp/48ef9229-a171-45ae-9528-eff23807c2cd', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | INFO 04-29 00:58:14 [__init__.py:239] Automatically detected platform cuda.
qwen3-1 | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fa1d4edf740>
qwen3-1 | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fed8a34d8e0>
qwen3-1 | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fef3bf9f470>
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e2d147db'), local_subscribe_addr='ipc:///tmp/00d5e6a3-8ea8-46e5-a707-3e899932b09c', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7bfd0441'), local_subscribe_addr='ipc:///tmp/ce06e2d4-7ad0-4c0c-8aac-495426086826', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_58f60516'), local_subscribe_addr='ipc:///tmp/370530fd-a341-43cd-8f1b-a5d04c639836', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | WARNING 04-29 00:58:19 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f72c0ff0230>
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:19 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_a5ec45d6'), local_subscribe_addr='ipc:///tmp/dccddf7d-b44d-4f23-9a23-51ce8ef80739', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:20 [utils.py:1055] Found nccl from library libnccl.so.2
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:20 [pynccl.py:69] vLLM is using nccl==2.21.5
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m WARNING 04-29 00:58:21 [custom_all_reduce.py:136] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:21 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_d0f9f695'), local_subscribe_addr='ipc:///tmp/22358ed2-dec6-4e1b-812d-bb6df0196376', remote_subscribe_addr=None, remote_addr_ipv6=False)
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:21 [parallel_state.py:1004] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:21 [cuda.py:221] Using Flash Attention backend on V1 engine.
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:21 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:21 [gpu_model_runner.py:1329] Starting to load model Qwen/Qwen3-30B-A3B...
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 00:58:22 [weight_utils.py:265] Using model weights format ['*.safetensors', '*.bin']
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/16 [00:00<?, ?it/s]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 6% Completed | 1/16 [00:37<09:23, 37.54s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 12% Completed | 2/16 [00:59<06:33, 28.13s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 19% Completed | 3/16 [01:37<07:08, 32.97s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 25% Completed | 4/16 [02:15<06:56, 34.71s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 31% Completed | 5/16 [02:54<06:38, 36.22s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 38% Completed | 6/16 [03:32<06:08, 36.80s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 44% Completed | 7/16 [04:09<05:33, 37.05s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 50% Completed | 8/16 [04:48<05:00, 37.55s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 56% Completed | 9/16 [05:26<04:24, 37.72s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 62% Completed | 10/16 [06:04<03:47, 38.00s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 69% Completed | 11/16 [06:43<03:10, 38.16s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 75% Completed | 12/16 [07:21<02:32, 38.21s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 81% Completed | 13/16 [07:59<01:54, 38.03s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 88% Completed | 14/16 [08:06<00:57, 28.57s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 94% Completed | 15/16 [08:44<00:31, 31.60s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 100% Completed | 16/16 [09:18<00:00, 32.19s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
Loading safetensors checkpoint shards: 100% Completed | 16/16 [09:18<00:00, 34.89s/it]
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.37 seconds
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.63 seconds
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.63 seconds
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:07:40 [loader.py:458] Loading weights took 558.57 seconds
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 558.868858 seconds
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.045517 seconds
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.043662 seconds
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:07:41 [gpu_model_runner.py:1347] Model loading took 14.2474 GiB and 559.041898 seconds
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_1_0 for vLLM's torch.compile
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_2_0 for vLLM's torch.compile
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_3_0 for vLLM's torch.compile
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:08:02 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/5477567bed/rank_0_0 for vLLM's torch.compile
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.06 s
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.05 s
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.06 s
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:08:02 [backends.py:430] Dynamo bytecode transform time: 21.05 s
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:08:11 [backends.py:136] Cache the graph of shape None for later use
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:09:00 [backends.py:148] Compiling a graph for general shape takes 56.80 s
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:09:01 [backends.py:148] Compiling a graph for general shape takes 57.65 s
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:09:02 [backends.py:148] Compiling a graph for general shape takes 58.92 s
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:09:02 [backends.py:148] Compiling a graph for general shape takes 59.09 s
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m WARNING 04-29 01:09:06 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10.json
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 80.14 s in total
qwen3-1 | [1;36m(VllmWorker rank=2 pid=107)[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 77.85 s in total
qwen3-1 | [1;36m(VllmWorker rank=0 pid=105)[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 78.70 s in total
qwen3-1 | [1;36m(VllmWorker rank=3 pid=108)[0;0m INFO 04-29 01:09:47 [monitor.py:33] torch.compile takes 79.98 s in total
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:634] GPU KV cache size: 198,784 tokens
qwen3-1 | INFO 04-29 01:09:49 [kv_cache_utils.py:637] Maximum concurrency for 8,192 tokens per request: 24.27x
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] WorkerProc hit an exception.
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] Traceback (most recent call last):
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] output = func(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 242, in compile_or_warm_up_model
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] self.model_runner.capture_model()
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1678, in capture_model
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] self._dummy_run(num_tokens)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return func(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1497, in _dummy_run
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] outputs = model(
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 509, in forward
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] hidden_states = self.model(input_ids, positions, intermediate_tensors,
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 245, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] model_output = self.forward(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 350, in forward
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] def forward(
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return fn(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 822, in call_wrapped
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self._wrapped_call(self, *args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] raise e
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 387, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "<eval_with_key>.98", line 449, in forward
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] submod_2 = self.submod_2(getitem_3, s0, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, getitem_4, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_gate_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w13_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w2_weight_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = getitem_4 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_gate_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w13_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_experts_parameters_w2_weight_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_k_norm_parameters_weight_ = None
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 653, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return entry.runnable(*args)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return fn(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1184, in forward
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return compiled_fn(full_args)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 323, in runtime_wrapper
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] all_outs = call_func_at_runtime_with_args(
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] out = normalize_as_list(f(args))
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 672, in inner_fn
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] outs = compiled_fn(args)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 490, in wrapper
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return compiled_fn(runtime_args)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 466, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self.current_callable(inputs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2128, in run
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return model(new_inputs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/root/.cache/vllm/torch_compile_cache/5477567bed/rank_1_0/inductor_cache/5x/c5xqlf36yrrwuc3hzszl5bgdec6wpaa5dackr5bj6hs27gla47b3.py", line 634, in call
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] torch.ops.vllm.inplace_fused_experts.default(buf7, arg6_1, arg7_1, buf15, buf1, 'silu', False, False, False, False, False, False, 128)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 723, in __call__
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] return self._op(*args, **kwargs)
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 986, in inplace_fused_experts
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] fused_experts_impl(hidden_states, w1, w2, topk_weights, topk_ids, True,
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 1295, in fused_experts_impl
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] cache13 = torch.empty(M * top_k_num * max(N, K),
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
qwen3-1 | [1;36m(VllmWorker rank=1 pid=106)[0;0m ERROR 04-29 01:10:12 [multiproc_executor.py:470] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 21.99 GiB of which 5.44 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 18.91 GiB is allocated by PyTorch, with 31.88 MiB allocated in private pools (e.g., CUDA Graphs), and 128.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
How can I disable reasoning in generative models, i.e. using
LLM.chat?I have opened #17356 to support this, can you try it?
How to disable thinking in generate api?
The thinking switch is based on chat template. So if you must use LLM.generate instead of LLM.chat, you can call tokenizer.apply_chat_template manually just like in HF repo before passing the prompt to LLM.generate
我的环境是Ubuntu 22.04 & vLLM v0.8.5 & pytorch 2.6.0+cu124 & 4 * H20 96GB 请问应该使用什么启动参数来运行Qwen3-235B-A22B? 我使用默认的
vllm serve ./Qwen3-235B-A22B --tensor-parallel-size 4来运行,会出现OOM的情况(尝试过--gpu-memory-utilization 参数没法解决)。默认参数使用8*H20可以成功运行,总显存消耗440GB。是否意味着4卡H20是没法运行的。@Silencezjl 两个方案:
- 使用Qwen3-235B-A22B-FP8,类似我的 4090 48GB x 8的方案。参数
docker run --runtime nvidia --gpus all \ -d \ -v /root/models:/models \ -p 5536:8000 \ --env "HF_HUB_OFFLINE=1" \ --ipc=host \ vllm/vllm-openai:latest \ --model /models/Qwen/Qwen3-235B-A22B-FP8 \ --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \ --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \ --served_model_name Qwen3-235B-A22B-FP8 \ --gpu_memory_utilization 0.9 \ --enable-reasoning --reasoning-parser deepseek_r1 \ --enable-auto-tool-choice --tool-call-parser hermes \ --host 0.0.0.0 \ --port 8000 \ --enable-expert-parallel \ --tensor-parallel-size 8你将--tensor-parallel-size 8改为4
- 使用
--cpu-offload-gb参数将每张卡应该装载到显存的模型卸载到CPU内存中,这个性能受限于PCIe速度和内存带宽,卸载到内存中的模型每次前向传播都会装载到GPU的显存中。
@GamePP 谢谢回复,FP8确实可以跑通。
I have an issue with the thinking budget control in qwen3.
I noticed that Alibaba Cloud's API has this parameter called "thinking_budget" but I can't find anything like that in the open-source docs. When I try adding this parameter to my code, it doesn't seem to do anything. Does the open-source model have this parameter?
Qwen3 support MCP add
--enable-reasoning --reasoning-parser deepseek_r1 \ --enable-auto-tool-choice --tool-call-parser hermes \I'm use docker 0.8.5 to infer
docker run --runtime nvidia --gpus all \ -d \ -v /root/models:/models \ -p 5536:8000 \ --env "HF_HUB_OFFLINE=1" \ --ipc=host \ vllm/vllm-openai:latest \ --model /models/Qwen/Qwen3-235B-A22B-FP8 \ --tokenizer /models/Qwen/Qwen3-235B-A22B-FP8 \ --generation-config /models/Qwen/Qwen3-235B-A22B-FP8 \ --served_model_name Qwen3-235B-A22B-FP8 \ --gpu_memory_utilization 0.9 \ --enable-reasoning --reasoning-parser deepseek_r1 \ --enable-auto-tool-choice --tool-call-parser hermes \ --host 0.0.0.0 \ --port 8000 \ --enable-expert-parallel \ --tensor-parallel-size 8GPUs: 4090 48G x 8
Could you share the token eval speed info?
The thinking switch is based on chat template. So if you must use
LLM.generateinstead ofLLM.chat, you can calltokenizer.apply_chat_templatemanually just like in HF repo before passing the prompt toLLM.generate
Starting the quantized model via
vllm serve Qwen/Qwen3-235B-A22B-FP8 --download-dir /app/data/models --tensor-parallel-size 4 --enable-auto-tool-choice --tool-call-parser hermes
it reasons even though --enable-reasoning flag is not set.
- Is that expected?
- Is there an easy way to use
tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
with OpenAIServingChat's main generation method? @DarkLight1337
You can create a ChatCompletionRequest object with chat_template_kwargs={"enable_thinking": False} and pass it to create_chat_completion. Then we will handle the apply_chat_template call for you.
When I use this command: vllm serve Qwen/Qwen3-30B-A3B-FP8 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1 --host 0.0.0.0 --port 6060
I get this error:
[multiproc_executor.py:470] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
I am running it on two A4500
(If you want the full stack trace you can ask but it is quite long)
Can anyone share a 4bit quant of Qwen3-30B-A3B (MOE) that works with vLLM? I tried unsloth's Q4_K_M GGUF and it did not work.