vllm [Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)

This draft PR adding support for Qwen2.5-Omni model (end-to-end full support).

This PR is a later version of #15130, it adds support for talker, code2wav, and an OmniLLMEngine class to manage the end-to-end audio generation process. You can see #15130 for more details about Qwen2.5-Omni model architecture.

NOTE: Since this PR makes significant changes to vLLM, its a draft and will not be merged in the short term.

Requirements

This PR requires https://github.com/huggingface/transformers/pull/36752.

pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8

Note: You need to install transformers from source from that branch

Example Usage

python examples/offline_inference/qwen2_5_omni/end2end.py --model Qwen/Qwen2.5-Omni-7B --prompt audio-in-video-v2 --enforce-eager --do-wave --voice-type m02 --warmup-voice-type m02

This command will print text output and generate .wav output files under current folder.

Apr 09 '25 13:04 fyabc

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 09 '25 13:04 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @fyabc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 09 '25 13:04 mergify[bot]

I think we can further split this PR, with the first one (after Qwen2.5-Omni thinker only) adding prompt_embeds support to vLLM. For reference, here are some previous/ongoing efforts to add this feature:

#6869
#11684
#15428

Apr 09 '25 14:04 DarkLight1337

Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM!

Apr 09 '25 19:04 ywang96

Is this fork still usable? After cloning and building I got the following errors:

root@ubuntu:/workspace# python examples/offline_inference/qwen2_5_omni/end2end.py --model Qwen/Qwen2.5-Omni-7B --prompt audio-in-video-v2 --enforce-eager --do-wave --voice-type m02 --warmup-voice-type m02
INFO 06-01 00:40:02 [__init__.py:239] Automatically detected platform cuda.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
/workspace/examples/offline_inference/qwen2_5_omni/end2end.py:258: UserWarning: PySoundFile failed. Trying audioread instead.
  librosa.load(temp_video_file_path, sr=16000)[0])
/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
        Deprecated as of librosa version 0.10.0.
        It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 176, in load
    y, sr_native = __soundfile_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 209, in __soundfile_load
    context = sf.SoundFile(path)
              ^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/soundfile.py", line 690, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/soundfile.py", line 1265, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/tmp/tmp3_ttt320': Format not recognised.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 677, in <module>
    main()
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 651, in main
    prompt = make_omni_prompt()
             ^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 480, in make_omni_prompt
    prompt = make_audio_in_video_v2_prompt()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 400, in make_audio_in_video_v2_prompt
    prompt = make_inputs_qwen2_omni(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/examples/offline_inference/qwen2_5_omni/end2end.py", line 258, in make_inputs_qwen2_omni
    librosa.load(temp_video_file_path, sr=16000)[0])
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 184, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/util/decorators.py", line 63, in __wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/librosa/core/audio.py", line 240, in __audioread_load
    reader = audioread.audio_open(path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/audioread/__init__.py", line 132, in audio_open
    raise NoBackendError()
audioread.exceptions.NoBackendError

Jun 01 '25 00:06 majunze2001

watching ...

Jun 23 '25 17:06 liaoweiguo

@majunze2001 librosa needs filename suffix to get the file format in some cases, add suffix to your tmpfile and try again.

Jul 01 '25 08:07 BakerBunker

Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM!

looking forward to this feature!

Jul 16 '25 07:07 SamitHuang

looking forward to this feature! Still in progress now?

Nov 03 '25 11:11 hashiting