[Model] Extend Ultravox to accept audio longer than 30s
Currently the Ultravox model input is capped to 30 seconds and extra audio is truncated (AFAIK). Also each sample is fed to Whisper individually (without being batched).
This PR allows using longer audio by chunking them first, using Whisper encoder in batch mode, and then concatenates them.
TODO:
- [x] processors on HF still need to be updated in tandem with this PR.
- [x] run evaluations with the updated model to verify the changes.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
FYI @NickLucche for the usage of whisper
Thanks for the contrib! What is the chunking logic for tiling the audio? Feel free to link the hf processor PR.
re @NickLucche: Here's the processor link: https://huggingface.co/fixie-ai/ultravox-v0_3-llama-3_2-1b/blob/main/ultravox_processing.py#L209
The logic: for each audio, split to 30 second chunks (but do not pad the last item to 30s, which is the same as before).
Then we flatten and batch everything up and run Whisper as if they were separate audios. We use audio_lens to compute an attention_mask for the last chunk per audio. The final embeddings are then concatenated.
There are other ways we could've done this, but it matches what we do on the Ultravox side for both some fine-tuning that we do and evals. If we end up updating those I'll update VLLM as well.
Also, note that since we don't pad the last chunk, and since in most cases we have smaller than 30s audio, the number of frames do not match across samples. ~~I didn't see a collator anywhere that I could update. I'm suspecting that I'll have to update _process_audio_input further to handle that.~~ Updated _process_audio_input.
Ok I see then that's a naive chunking where you don't account for splitting mid-word nor you have any overlap and/or prompt from previous chunk.
This case seems much easier to handle vllm-side, given changes are already in hf. Let's just make sure the batched whisper forward is accounted for by the initial profiler run to avoid oom.
Thanks for the comments. This PR has been ready to review.
For reference, I can confirm that the evals have improved.
before (8B model):
eval subset model samples score tokens
0 audio-bigbench-30s - vllm None 66.67 None
1 audio-bigbench-nolimit - vllm None 62.60 None
2 audio-translate-covost-en_de en_de vllm None 28.60 None
3 audiobench-dream-tts-mcq-30s - vllm None 85.41 None
4 audiobench-dream-tts-mcq-nolimit - vllm None 76.79 None
after (8B model):
eval subset model samples score tokens
0 audio-bigbench-30s - vllm None 67.42 None
1 audio-bigbench-nolimit - vllm None 65.10 None
2 audio-translate-covost-en_de en_de vllm None 28.66 None
3 audiobench-dream-tts-mcq-30s - vllm None 84.92 None
4 audiobench-dream-tts-mcq-nolimit - vllm None 84.89 None
Rows 0, 2, and 3 are there as a sanity check. Difference of less than 1 point is usually not significant (specially on model-as-judge evals). 30s means the subset of samples that are under 30 seconds long. nolimit means the full set and these are the sets for which we see 3 and 8 points of improvement.
A similar trend is seen on 70B which reaches 90.30 on audio-bigbench-nolimit compared to 82.9 that we had reported before.
Thanks!
Surprised to see how big of a leap can a simple chunking strategy achieve!
Just to clarify, the difference in metrics is not because of a "better" chunking strategy. It's just that, before this we used to throw away any audio past 30 seconds. Any chunking strategy is probably better than no strategy 😅
Can you update tests/models/decoder_only/audio_language/test_ultravox.py back to using v0.5 as well?
The tests are finally passing, yay! The PR is ready to merge.
Nice, let's merge this then.
Hey guys, looks like this PR broke the Ultravox LoRA tests. Both V0 and V1 tests/lora/test_ultravox.py "were" failing.
I can't seem to repro the V0 test failure locally.
I can repro the V1 test failure - The test hits the assert https://github.com/vllm-project/vllm/blob/53be4a863486d02bd96a59c674bbec23eec508f6/vllm/v1/worker/gpu_model_runner.py#L1380 . If I remove the assert the test works fine. @ywang96 @DarkLight1337 can you please take a look when you get a chance. Thanks!
cc @jeejeelee @robertgshaw2-redhat