vllm [Model] Extend Ultravox to accept audio longer than 30s

Currently the Ultravox model input is capped to 30 seconds and extra audio is truncated (AFAIK). Also each sample is fed to Whisper individually (without being batched).

This PR allows using longer audio by chunking them first, using Whisper encoder in batch mode, and then concatenates them.

TODO:

[x] processors on HF still need to be updated in tandem with this PR.
[x] run evaluations with the updated model to verify the changes.

Feb 20 '25 21:02 farzadab

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 20 '25 21:02 github-actions[bot]

FYI @NickLucche for the usage of whisper

Feb 21 '25 00:02 mgoin

Thanks for the contrib! What is the chunking logic for tiling the audio? Feel free to link the hf processor PR.

Feb 21 '25 15:02 NickLucche

re @NickLucche: Here's the processor link: https://huggingface.co/fixie-ai/ultravox-v0_3-llama-3_2-1b/blob/main/ultravox_processing.py#L209

The logic: for each audio, split to 30 second chunks (but do not pad the last item to 30s, which is the same as before). Then we flatten and batch everything up and run Whisper as if they were separate audios. We use audio_lens to compute an attention_mask for the last chunk per audio. The final embeddings are then concatenated.

There are other ways we could've done this, but it matches what we do on the Ultravox side for both some fine-tuning that we do and evals. If we end up updating those I'll update VLLM as well.

Also, note that since we don't pad the last chunk, and since in most cases we have smaller than 30s audio, the number of frames do not match across samples. ~~I didn't see a collator anywhere that I could update. I'm suspecting that I'll have to update _process_audio_input further to handle that.~~ Updated _process_audio_input.

Feb 21 '25 19:02 farzadab

Ok I see then that's a naive chunking where you don't account for splitting mid-word nor you have any overlap and/or prompt from previous chunk.

This case seems much easier to handle vllm-side, given changes are already in hf. Let's just make sure the batched whisper forward is accounted for by the initial profiler run to avoid oom.

Feb 22 '25 08:02 NickLucche

Thanks for the comments. This PR has been ready to review.

For reference, I can confirm that the evals have improved.

before (8B model):

                                eval    subset model samples  score tokens
0                 audio-bigbench-30s         -  vllm    None  66.67   None
1             audio-bigbench-nolimit         -  vllm    None  62.60   None
2       audio-translate-covost-en_de     en_de  vllm    None  28.60   None
3       audiobench-dream-tts-mcq-30s         -  vllm    None  85.41   None
4   audiobench-dream-tts-mcq-nolimit         -  vllm    None  76.79   None

after (8B model):

                                eval    subset model samples  score tokens
0                 audio-bigbench-30s         -  vllm    None  67.42   None
1             audio-bigbench-nolimit         -  vllm    None  65.10   None
2       audio-translate-covost-en_de     en_de  vllm    None  28.66   None
3       audiobench-dream-tts-mcq-30s         -  vllm    None  84.92   None
4   audiobench-dream-tts-mcq-nolimit         -  vllm    None  84.89   None

Rows 0, 2, and 3 are there as a sanity check. Difference of less than 1 point is usually not significant (specially on model-as-judge evals). 30s means the subset of samples that are under 30 seconds long. nolimit means the full set and these are the sets for which we see 3 and 8 points of improvement.

A similar trend is seen on 70B which reaches 90.30 on audio-bigbench-nolimit compared to 82.9 that we had reported before.

Feb 28 '25 00:02 farzadab

Thanks!

Surprised to see how big of a leap can a simple chunking strategy achieve!

Just to clarify, the difference in metrics is not because of a "better" chunking strategy. It's just that, before this we used to throw away any audio past 30 seconds. Any chunking strategy is probably better than no strategy 😅

Feb 28 '25 22:02 farzadab

Can you update tests/models/decoder_only/audio_language/test_ultravox.py back to using v0.5 as well?

Mar 01 '25 06:03 DarkLight1337

The tests are finally passing, yay! The PR is ready to merge.

Mar 11 '25 20:03 farzadab

Nice, let's merge this then.

Mar 12 '25 02:03 DarkLight1337

Hey guys, looks like this PR broke the Ultravox LoRA tests. Both V0 and V1 tests/lora/test_ultravox.py "were" failing. I can't seem to repro the V0 test failure locally. I can repro the V1 test failure - The test hits the assert https://github.com/vllm-project/vllm/blob/53be4a863486d02bd96a59c674bbec23eec508f6/vllm/v1/worker/gpu_model_runner.py#L1380 . If I remove the assert the test works fine. @ywang96 @DarkLight1337 can you please take a look when you get a chance. Thanks! cc @jeejeelee @robertgshaw2-redhat

Mar 12 '25 20:03 varun-sundar-rabindranath