vllm [Model] Add Granite Speech Support

This PR adds support for Granite Speech models, and is a port of the corresponding PR in Transformers. The model uses a conformer-based encoder, with a blip2 qformer-based projector to encode the audio, and masks it into a granite LLM. This model also uses an audio-specific lora adapter, which should only be enabled when the model is processing audio inputs. Currently, this means that the user needs to make a LoraRequest every time they send audio.

It is probably a good idea to wait for the transformers PR to be merged so that everything is aligned, but opening this PR in case anyone has feedback 🙂 unfortunately, a model compatible with this PR is not publicly available yet - I am happy to submit a follow-up PR adding an example / docs + tests once one it is out.

Some quirks that are good to be aware of / have kind of gross edge cases that I am actively looking into:

The (rank 64) lora is bundled in the same dir as the model. At least in offline mode, it seems that the lora is loaded, but the lora layers are adding zero tensors, which result in unchanged outputs - still looking into this.
The model is very sensitive - I haven't optimized the conformer implementation yet, but if possible, it would be great if we could for now avoid optimizing the conformer layers until we also have some tests for alignment with HF once the model is released, as the optimizations in the granite LLM already seem to shift things a bit, and I still need to run a quality benchmark (still looking into whatever is going on with the lora first!)
Batching is a bit quirky because we don't use a feature attention mask and do zero padding prior to calculating the Mel features in the HF processor (i.e., the padding indices end up at small negative numbers that are dependent on the batch, though they are masked out in transformers with a masked scatter which is the most important thing). Since the static batch is submitted one instance at a time to the processor in vLLM, this results in the features being unpadded; this PR handles this after the fact by zero padding the 3D Mel features and torch splitting the result, though maybe there is a better small negative value to use here.

CC @DarkLight1337 @njhill @tlrmchlsmth

Apr 08 '25 08:04 alex-jw-brooks

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 08 '25 08:04 github-actions[bot]

Thanks for opening this!

This model also uses an audio-specific lora adapter, which should only be enabled when the model is processing audio inputs. Currently, this means that the user needs to make a LoraRequest every time they send audio.

This is fine, it's somewhat like how Phi-4-multimodal is handled. Can you add this model to the examples so users know how to use it?

Also please update the supported models page, processor tests (tests/models/multimodal/processing/test_common.py) and test registry (tests/models/registry.py).

Apr 08 '25 08:04 DarkLight1337

Can you also add this model to tests/models/multimodal/processing/test_common.py to test the processing logic?

Apr 21 '25 12:04 DarkLight1337

Also please add this model to the Supported Models doc page

Apr 21 '25 12:04 DarkLight1337

Sorry for the ping before finishing the first requested changes, I think it may have automatically re-requested code owner review when I force pushed! That all sounds good to me, will work on it asap now that the transformers PR is merged

Apr 22 '25 01:04 alex-jw-brooks

Hey @DarkLight1337, I think this should be ready for another look when you have a moment!

The bug fix for the lora name parsing https://github.com/vllm-project/vllm/pull/17196 is needed for this model to work properly, but things look aligned with the transformers PR when this PR is rebased on top of that one 🙂

Apr 25 '25 23:04 alex-jw-brooks

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 26 '25 16:04 mergify[bot]

Otherwise LGTM. Have you verified that the model works correctly on your end?

Apr 27 '25 02:04 DarkLight1337

Yup, things look right on my side of things! I went ahead and pulled the audio assets fixtures from the ultravox tests to conftest and added a generation test under tests/models/decoder_only/audio_language/test_granite_speech.py, which won't currently run because transformers hasn't cut 4.52 yet, but it does pass on my machine using the tip of transformers.

Also realized the audio placeholder was missing for running online, so added that too 🙂

Apr 28 '25 02:04 alex-jw-brooks

vllm vllm copied to clipboard

[Model] Add Granite Speech Support

vllm
vllm copied to clipboard