vllm [Bug]: Qwen-2-audio requires <|AUDIO|> tag in prompt

Your current environment

The output of `python collect_env.py`

collectcng environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14) 12.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.36

Python version: 3.12.6 (main, Sep 27 2024, 06:10:12) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-4.4.0-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40S
Nvidia driver version: 570.86.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              5
On-line CPU(s) list: 0-4
Vendor ID:           AuthenticAMD
Model name:          unknown
CPU family:          175
Model:               1
Thread(s) per core:  1
Core(s) per socket:  5
Socket(s):           1
Stepping:            unknown
BogoMIPS:            3599.32
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clwb sha_ni xsaveopt xsavec xgetbv1 vaes vpclmulqdq rdpid
Hypervisor vendor:   KVM
Virtualization type: full

Versions of relevant libraries:
[pip3] numpy==2.1.3
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] triton==3.2.0
[conda] Could not collect

🐛 Describe the bug

When trying to prompt qwen-2-audio with an audio file, the textual prompt requires <|AUDIO|> in it, otherwise the server throws a 500 error.

Seems potentially related to this issue, or it's the closest I could find.

Here are the logs when including the <|AUDIO|> tag in the request

INFO 04-12 14:07:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-12 14:07:48 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-12 14:07:48 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2-Audio-7B-Instruct', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='<REDACT>', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2-Audio-7B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision='0a095220c30b7b31434169c3086508ef3ea5bf0a', code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=16, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x2b737ba3cae0>)
INFO 04-12 14:07:57 [config.py:600] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 04-12 14:07:57 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-12 14:07:57 [api_server.py:246] Started engine process with PID 20
INFO 04-12 14:08:02 [__init__.py:239] Automatically detected platform cuda.
INFO 04-12 14:08:03 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='Qwen/Qwen2-Audio-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-Audio-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=0a095220c30b7b31434169c3086508ef3ea5bf0a, override_neuron_config=None, tokenizer_revision=0a095220c30b7b31434169c3086508ef3ea5bf0a, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2-Audio-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
INFO 04-12 14:08:05 [cuda.py:292] Using Flash Attention backend.
INFO 04-12 14:08:06 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-12 14:08:06 [model_runner.py:1110] Starting to load model Qwen/Qwen2-Audio-7B-Instruct...
INFO 04-12 14:08:06 [config.py:3334] cudagraph sizes specified by model runner [] is overridden by config []
INFO 04-12 14:08:06 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:05,  1.34s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:02<00:04,  1.42s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:04<00:02,  1.43s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:05<00:01,  1.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.21s/it]

INFO 04-12 14:08:12 [loader.py:447] Loading weights took 6.07 seconds
INFO 04-12 14:08:13 [model_runner.py:1146] Model loading took 15.6455 GiB and 6.682731 seconds
/usr/local/lib/python3.12/site-packages/vllm/inputs/registry.py:167: FutureWarning: `audios` is deprecated and will be removed in version 4.54.0 for `Qwen2AudioProcessor.__call__`. Use `audio` instead.
  return hf_processor(**data, **merged_kwargs, return_tensors="pt")
INFO 04-12 14:08:17 [worker.py:267] Memory profiling takes 4.10 seconds
INFO 04-12 14:08:17 [worker.py:267] the current vLLM instance can use total_gpu_memory (47.38GiB) x gpu_memory_utilization (0.90) = 42.65GiB
INFO 04-12 14:08:17 [worker.py:267] model weights take 15.65GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 0.83GiB; the rest of the memory reserved for KV Cache is 26.08GiB.
INFO 04-12 14:08:17 [executor_base.py:112] # cuda blocks: 3338, # CPU blocks: 512
INFO 04-12 14:08:17 [executor_base.py:117] Maximum concurrency for 8192 tokens per request: 6.52x
INFO 04-12 14:08:20 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 7.23 seconds
WARNING 04-12 14:08:20 [config.py:1088] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-12 14:08:20 [serving_chat.py:117] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.5}
INFO 04-12 14:08:20 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.5}
INFO 04-12 14:08:20 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-12 14:08:20 [launcher.py:26] Available routes are:
INFO 04-12 14:08:20 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-12 14:08:20 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-12 14:08:20 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-12 14:08:20 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-12 14:08:20 [launcher.py:34] Route: /health, Methods: GET
INFO 04-12 14:08:20 [launcher.py:34] Route: /load, Methods: GET
INFO 04-12 14:08:20 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-12 14:08:20 [launcher.py:34] Route: /version, Methods: GET
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /score, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-12 14:08:20 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [5]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 04-12 14:08:22 [chat_utils.py:396] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
INFO 04-12 14:08:33 [logger.py:39] Received request chatcmpl-c24dd97d1a504d6d8c507bb0b4037d48: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGiven the following audio, return the fields as received. If one is missing, set it as null<|AUDIO|>\n\n\nUse the following response format{\n  name: string or null,\n  last_meal: string or null,\n  age: int or null,\n  currently_sitting: bool or null,\n  favorite_color: "green" or "blue" or "red" or "Unknown",\n}<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=42, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=8093, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
/usr/local/lib/python3.12/site-packages/vllm/inputs/registry.py:167: FutureWarning: `audios` is deprecated and will be removed in version 4.54.0 for `Qwen2AudioProcessor.__call__`. Use `audio` instead.
  return hf_processor(**data, **merged_kwargs, return_tensors="pt")
INFO 04-12 14:08:36 [engine.py:310] Added request chatcmpl-c24dd97d1a504d6d8c507bb0b4037d48.
INFO 04-12 14:08:36 [metrics.py:488] Avg prompt throughput: 121.0 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.3%, CPU KV cache usage: 0.0%.
INFO:     172.20.7.138:34773 - "POST /v1/chat/completions HTTP/1.1" 200 OK
   POST /v1/chat/completions -> 200 OK  (duration: 60.9 s, execution: 16.2 s)
INFO 04-12 14:08:46 [metrics.py:488] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 04-12 14:08:56 [metrics.py:488] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

I'm seving the model with the following params

md = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
        "--enforce-eager",
        "--max-num-seqs",
        "16",
        "--max-model-len",
        "8192",
        "--api-key",
        API_KEY,
    ]

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Apr 12 '25 14:04 Ben-Epstein

cc @fyabc can you get your team to update the HF repo with the correct chat template from https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_audio/processing_qwen2_audio.py#L211? This workaround doesn't work for us because we call the tokenizer's apply_chat_template instead of using the chat template from the processor class

Apr 12 '25 14:04 DarkLight1337

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Jul 13 '25 02:07 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Aug 12 '25 02:08 github-actions[bot]