vllm [RFC]: Multi-modality Support on vLLM

[Open issues - help wanted!]

Update [11/18] - In the upcoming months, we will focus on performance optimization for multimodal models as part of vLLM V1 engine re-arch effort

P0 (We will definitely work on them):

[ ] V1 re-arch for multimodal models - See high-level design (Slides, Doc)
- [ ] Core
  - [x] [1/N] #9871
  - [x] [2/N] #10374
  - [x] [3/N] #10570
  - [x] [4/N] #10699
  - [x] [5/N] #11210
  - [x] [6/N] #12128
  - [x] [7/N] Enable rest of single-modality LMMs on V1
    - [x] #11632 (Aria, BLIP-2, Chameleon, Fuyu)
    - [x] #14275
    - [x] #11685
    - [x] #11733
    - [x] #12069
    - [x] #12504
    - [x] #12660
  - [x] [8/N] Enable mixed-modality inference on V1
    - [x] #11685
    - [x] #12259
- [x] Multimodal prefix caching
  - [x] #10507
  - [x] #11187
  - [x] #11646
- [ ] Multimodal input & embedding caching
  - [x] #11020
  - [x] #11396
  - [ ] Reuse multimodal embeddings from encoder cache
[ ] #10114

P1 (We should be aware of these and spend some time if possible):

[ ] More efficient multimodal input data processing
[ ] Quantization for LMMs
[ ] LoRA for LMMs
- [ ] #8802
- [x] #9495
- [ ] LoRA for VLM2Vec
[ ] Consolidate ViT attention backend
[ ] V1 spec decode for VLMs
[ ] Update developer facing documentation for V1 re-arch multimodal models.
- [x] #11998

P2 (We should work on these when they become more important/frequently requested):

[x] Update OpenAI-compatible server to use OpenAI Audio API
- [x] #11027
[ ] Support HuggingFace multimodal chat format for llm.chat() API
[ ] Next steps for Multimodal Llama
[ ] Better encoder cache & compute budget strategy
- [x] #11895
[ ] Better profiling strategy
[ ] Prototype separating vision encoder to its own worker (fully disaggregated from decoder)

Update [9/8] - We have finished majority of the refactoring and made extensive progress for supporting multimodal models. See details here.

Roadmap for Q3 2024

In the upcoming months, we will focus on enabling multimodal models to be compatible with other performance-related features on vLLM as well as collaborating with model vendors to directly onboard new multimodal models.

P0 (We will definitely work on them):

#10114
- #10040
- #10044
- [2/N] Convert LLaVA-1.5, Phi-3-Vision, Qwen2-VL and Ultravox to multi-modal processor as POC and add tests
- [3/N] Deprecate the old code for input processor/mapper so external developers have time to convert
- [4/N] Convert the rest of the built-in vLLM models to multi-modal processor
- [5/N] Remove the old code for input processor/mapper
Proper chunked prefill with multimodal input
- #8346
- #9950
Prefix caching with multimodal input
- #8348
Enable flamingo-style multimodal models (e.g., Multimodal Llama)
- #8811
- #8822
Fully enable video input, and therefore, mixed multi-modal input
- #7559
- #10020
Update OpenAI-compatible server to use OpenAI Audio API
Multimodal embedding models
- #9303
- #9576
- #9759
- #9912
- #9944
- #9919
Shepherd model support directly from model vendor
- #8377
- #7905
- #8486
- #8811
  - #9095
  - #9393
  - Next steps for Llama 3.2 vision model
- #9242
- #9016
- #9248

P1 (We should be aware of these and spend some time if possible):

Better profiling strategy for multimodal models
Multi-input support for more compatible models
- Chameleon
- #8201
- LLaVA-NeXT-Video
- #8905
Better developer facing documentation for adding new models
Add more multimodal models, and shepherd model support from community contributions
- #7559
- #8029
- #9747
- #9767
- See full list of multimodal models to implement
Misc bug fixes

P2 (We should work on these when they become more important/frequently requested):

Multimodal models with LoRA
- #7585
- #7199
- #8943
- #9622
- #10022
- #10281
- #8802
- #9495
- LoRA for VLM2Vec
Quantized multimodal models
- #9217
- #9772
- #9720
- #9812
- #9891
- #9921
Refactor currently supported multimodal models for dynamic ViT&LM loading
- #7153
- #8407
Enable LM-only loading for multimodal models that support embeddings as input
Multimodal benchmarking (Online & Offline)
- #8495
- #9851
- #10287
PP for multimodal models
- #8696
- #7168
Extra input mapper/processor kwargs
- #8657
- #8658
- #8946
- #8856
- #9131
OOT multimodal models
- #8717

Update [7/3] - We have finished our 2nd refactoring milestone - see details here.

Roadmap for 3rd Milestone

In the upcoming months, we will focus on wrapping up the main goal of this refactoring RFC and supporting more models and modalities.

P0 (We will definitely work on these):

Support image embeddings as input
- #6613
- Support image embeddings for Fuyu and MiniCPM-V
Support multiple multi-modal inputs whenever the model supports it (detailed plan)
- #7126
- #7230
- #7783
- #7902
- #7963
- Multi-image support for Chameleon & InternVL
- #8049
Merge at least 3 VLMs from the currently opened PRs
- #5770, #6633
- #5920
- #4087
- #3924
- #5817
- #6514
Better documentation
- #8181

P1 (We should be aware of these and spend some time if possible):

Aid support for Whisper with multimodal interface
- #5964
Custom vision prompt template in OpenAI-compatible server
Sharding Vision Encoder & MultiModalProjector
- #7186
Bug Fixes
Add more VLMs - See full List of vision models to implement
Better error handling
- #7998
- #8028
- Follow-up to #8028

P2 (We should work on these when they become more frequently requested) Help wanted!:

Port over more vision encoders
- #6942
- #7020 (Idefics2VisionTransformer)
Dynamic vision encoder and LM backbone
- #7067
- #7153
- BLIP-2 w/ FLAN-T5
  - #8407
  - #3117
VLM with Lora
- #7199
Quantized VLMs
- #7187
Add/aid support for models with other modalities
- #7446
- #7615
- #7559
Enable other features in vLLM with multi-modal models (e.g, chunked prefill, automatic prefix caching)
- #8098

Update [6/11] - We have finished our 1st refactoring milestone - see details here.

Roadmap for 2nd Milestone

Some of the items @DarkLight1337, @xwjiang2010 and I are looking to work on as part of the next milestone are tentatively:

API Changes: A list of user-facing breaking changes can be found here

Completely remove the need for specifying image related arguments when launching the server, and infer configs from the model repo or a configmap in vLLM.
- #5852
- #6089
- #6121
Support dynamic image shape - This means the scheduler will need to know in advance the final shape of multi-modal embeddings that are processed right before being passed to the language model.
- #5214
- #5276

Performance related

Port CLIPVisionModel
- #5591
- #5717
Optimize CLIPAttention
Optimize MultiModalProjector
Blocks: #5481

Model support - Add more vision language models, and better developer facing documentation

Some of the ideas that we should work on in the future:

Make VLMs work with chunked prefill
Unify tokenizer & multi-modal processor (so that we can leverage AutoProcessor from transformers)
Prefix caching for images
Streaming inputs of multi-modal data

As always, please provide feedback and feature requests in this issue. Suggestions and contributions are very welcomed!

Original RFC

Multi-modality support was brought to vLLM recently, much thanks to https://github.com/vllm-project/vllm/pull/3042 from @xwjiang2010. Since then we have seen an increasing amount of interest in such models (from the number of pull requests and issues related). However, there are a few issues we should address with the current design before we bring in more features around multi-modality.

VisionLanguageConfig and MultiModalData
- Currently the multimodal input can be either pixel_values or image_feaures for simplicity. While this works well with llava 1.5 where pixel_values are the only output from its ClipImageProcessor, this does not work well when it comes to supporting models with more complicated preprocessing to return multiple outputs.(e.g, llava 1.6, fuyu, etc). Developers could add additional preprocessing inside model implementation as a workaround, but this will be unmaintainable over time.
- The overhead of requiring image_feature_size, image_token_id and image_input_shape is pushed to the user when these can/should be inferred from the model & processor config and not required at the inference time.
The current design assumes multi-modal inputs are already processed to be consumed by the model executable, but vLLM does not have a processor util. This blocks the vision model support on the OpenAI API server for end-to-end inference.
The current prompt format "<Image>" * 576 + prompt makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format "<Image>\n" + prompt and that has caused some confusion on what's needed to make multi-model work on vLLM.

Proposal Most items in the above issues have been discussed and addressed in the original Llava1.5 PR as well as https://github.com/vllm-project/vllm/pull/3978. We propose a few high-level design decisions for the refactoring and welcome any feedback!

Adding a processor util - We can leverage out-of-box AutoProcessor from transformers the same way we have been doing with tokenizer as an attribute of LLMEngine (e.g., self.multi_modal_processor = AutoProcessor(model)). This allows us to support end-to-end inference with the API server as well as the LLM object.
Frontend input format: Because of 1, we can keep the same format as HuggingFace since that's how users usually discover new models and it makes end-to-end integration test easier. Preprocessing should be hidden away from the interface and user. For example, this preprocessing step can be done inside LLMEngine.add_request() around the same place as https://github.com/vllm-project/vllm/blob/a134ef6f5e6c24d3cd459c63557e5db276db25b2/vllm/engine/llm_engine.py#L385-L391 Here's a pesudocode

if multi_modal_input is None:
   prompt_token_ids = self.encode_request( 
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request)
else:
   # preprocessed_inputs is a dictionary of key(str)-value(tensor)
   # as output of self.multi_modal_processor
   preprocessed_inputs = self.preprocess_request(
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request,
       multi_modal_input=images)
   prompt_token_ids = preprocessed_inputs.pop("input_ids")
   multi_modal_data = MultiModalData(data=preprocessed_inputs)
...

and thus at LLM level, only image tensors will be required.

Refactor MultiModalData: Now this object simply holds the multi-modal data dictionary that we need for the model_executable. At inference time, data is unpacked in the forward pass - this approach is similar to transformer implementation of multi-modal models.
Refactor VisionLanguageConfig: This config is a lot simpler now. One caveat is that sometimes when the image features can be dynamic, users may specify an optional max_feature_size to help engine run the profiling for the worst-case scenario as well as to potentially abort certain requests.
Regarding the original image_feature as input type design: IMO LlaVA is a special case among multi-modal models since its vision encoder is detached from the language model and can be initialized separately, but in this case, one could argue that for the MultiModalProjector as well, and perhaps passing image_feature (outputs of CLIP) is a design decision not generalizable to all other models. Instead, passing multi-modal embeddings (outputs of CLIP -> Projector) at inference time is more flexible and should work nicely with other models. (One followup question is, does it make sense to actually define a separate Llava-no-clip module, since this is so specific to llava, to make our life easier?)

With the above changes, as an end-user, ideally you then should be able to do something like the following

from PIL import Image
from vllm import LLM
from vllm.config import VisionLanguageConfig

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
llm = LLM(model=model_id, multi_modal_input_type=VisionLanguageConfig.IMAGE_INPUT_TYPE.IMAGE) # This can also be EMBEDDINGS

prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

llm.generate(prompt, ..., multi_modal_input=image)

Under the hood, the pipeline is

prompt, image
-> prompt_token_ids, MultiModalData(data=preprocessed_inputs) # through preprocess within engine.add_request() 
-> prompt_token_ids, pixel_values, image_sizes  # though unpacking in implementation of model's `forward`.

I will follow up with a series of PR for refactoring but please leave any feedback since this is a pretty significant interface change.

Apr 19 '24 07:04 ywang96

cc @DarkLight1337 @Isotr0py @alsichcan

Apr 19 '24 07:04 ywang96

Thank you for kickstarting this conversation!

Re: Issues

I fully agree with the issues which you have pointed out. I would like to add that the current prompt format is hardly extensible for multi-image input if we plan to pursue that further down the line. In #3978, I have proposed some ways of tackling the issue at the level of OpenAI-compatible server. I have thought about them more and have decided that they alone cannot provide the required flexibility, as explained below:

If there are only a small number of standard methods, we can provide a config option to choose which method to apply. I have added the image_openai attribute to VisionLanguageConfig to facilitate this.

I am not confident that this assumption would hold for very long, given the fast-changing pace of the field.

A more flexible option would be to pass the image(s) to the chat template (e.g. by setting the images attribute alongside role and content). This transfers the burden of implementation to the maintainers of the model on HuggingFace, making it more likely that vLLM users have to implement their own template. I have created ConversationMessage class to represent the dictionary for each message.

I feel that this should be limited to cases where we only have to pass a single <image> token. The requirement of duplicating image tokens according to feature size should not be a concern of the chat template.

This is not to mention that you still have to manually duplicate the <image> tokens when using vLLM engine directly.

Re: Proposals

Here are my own thoughts on each proposal:

1. Adding a processor util

I think that we should move this responsibility outside of the Engine class. This is because multi-modal input isn't necessarily limited to image data, so we should expect more data types to be added in the future. To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine. This way, we can easily add new data types by simply defining a new processor class. For your reference, I have implemented this pattern in #4197.

2. Frontend input format

My comments on this are similar for Proposal 1. However, #4197 only refactors MultiModalData to define data processing logic. To avoid excessive duplication of the logic of encode_request, we should find a way to let MultiModalData control only parts of the process. Also, in my idea of MultiModalData, the processing logic should remain independent of the model architecture. I guess this is where Proposal 3 comes in: HuggingFace processors should output dictionaries with keys that match the parameter names of model.forward().

3. Refactor `MultiModalData`

I have refactored this class in #4197 according to this description, and it works well enough to support the image_size parameter of LLaVA-NeXT as shown in #4199.

4. Refactor `VisionLanguageConfig`

Currently in #4197, MultiModalData has to accept ModelConfig and VisionLanguageConfig separately. Perhaps we can make VisionLanguageConfig an attribute of ModelConfig so we do not have to pass in multiple parameters. Using this approach, we only have to add more attributes to ModelConfig instead of having to pass more config objects around in order to support additional multi-modal data types.

Regarding max_feature_size, refer to my comments on Proposal 5.

5. Regarding the original `image_feature` as input type design

Instead of indirectly specifying the input shapes through the config, we can have each model implement a method to return a dictionary (the required input shape for each keyword argument). For LLaVA, the feature size can be inferred from the HuggingFace config.json if we consider image size and patch size. To support profiling, we can slightly extend this to have the model define the maximum possible input shapes.

Is the unconventional prompt format "<image>" * image_feature_size + prompt mainly to support profiling? While implementing LLaVA-NeXT, I was under the impression that this is used to simplify the generation of the attention masks. Perhaps @xwjiang2010 would have a better idea.

Apr 19 '24 13:04 DarkLight1337

@ywang96 Thanks for driving the integration of more MM models into VLLM. :heart_eyes:

It seems that there is no plan to refactor vision encoder (todo in llava).

In my view, we should prioritize this, with performance being my main consideration.

By refactoring the vision encoder, we can establish an integration standard for MM models, similar to the our LLM models integration . This will not only ensure inference performance but also provide integration guidelines for the community

if I misunderstand, please correct me, thanks for your work again

Apr 19 '24 16:04 jeejeelee

Generally, I agreed with @DarkLight1337's opinion about moving processing logics out from Engine to prevent modifying core code frequently. However, I think it's difficult to keep the processing logics fully independent from the model architecture.

For example, FuyuProcessor and Idefics2Processor will pad input_ids with image_feature_size during preprocess, while LlavaProcessor won't (I guess this is also why "<image>" * image_feature_size + prompt is used for llava). This means that we need to pad input_ids for llava manually. (maybe there is a better way to handle this? 🤔)

Apr 19 '24 17:04 Isotr0py

cc @robertgshaw2-neuralmagic @mgoin (since NM's planned to work on whisper)

Thank you all for the feedback so far! I plan to address feedback altogether after meeting up with the core devs as well as getting more perspectives from other community members who are working/plan to work on multi-modal models.

Some quick ones that I can answer now:

It seems that there is no plan to refactor vision encoder (todo in llava).

@jeejeelee This will need to be done regardless since it's inside the model implementation, and this RFC is more around how we want to support multi-modal models in general, and thus focuses on the interface and component pattern.

However, I think it's difficult to keep the processing logics fully independent from the model architecture.

@DarkLight1337 @Isotr0py If this is just about where the processor should live, I'm indifferent between having it live inside LLMEngine or not. The tricky part IMO is that then we need to rework on the interface of LLMEngine to consume outputs of AutoProcessor as is.

I was under the impression that this is used to simplify the generation of the attention masks.

@DarkLight1337 That's correct too, but I'm worried that as the model gets more and more complicated, this approach might not be generalizable.

Apr 19 '24 19:04 ywang96

Since LLMEngine has support for an output processor interface, e.g. SequenceGroupOutputProcessor. Would it be reasonable within engine, to also add an InputProcessor interface?

This way engine can check for existing of an input processor, but the implementation in this case for llava's single image processing can live outside of engine. It's implementation could be as suggested, based on AutoProcessor.

As for supporting processing of something apart of an Image tag or varying formats - engine could only have a generic input processor executor, within the model executor's code, it would be up to the model implementation to define an input processor and pass it on to engine.

Apr 19 '24 19:04 imarcin-rbx

Generally, I agreed with @DarkLight1337's opinion about moving processing logics out from Engine to prevent modifying core code frequently. However, I think it's difficult to keep the processing logics fully independent from the model architecture.

For example, FuyuProcessor and Idefics2Processor will pad input_ids with image_feature_size during preprocess, while LlavaProcessor won't (I guess this is also why "<image>" * image_feature_size + prompt is used for llava). This means that we need to pad input_ids for llava manually. (maybe there is a better way to handle this? 🤔)

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

Apr 20 '24 02:04 DarkLight1337

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

Yes, I agree that we can use processor registry to solve this. And it seems that transformers_utils/configs could be a good reference for this.

Apr 20 '24 03:04 Isotr0py

@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors.

Yes, I agree that we can use processor registry to solve this. And it seems that transformers_utils/configs could be a good reference for this.

I have added an implementation of the processor registry to #4197.

Edit: I have also moved the specification of dummy data (for profiling) to the top-level registry. Each model can define its own dummy data by registering a factory function.

Apr 22 '24 08:04 DarkLight1337

2. Frontend input format

My comments on this are similar for Proposal 1. However, #4197 only refactors MultiModalData to define data processing logic. To avoid excessive duplication of the logic of encode_request, we should find a way to let MultiModalData control only parts of the process. Also, in my idea of MultiModalData, the processing logic should remain independent of the model architecture. I guess this is where Proposal 3 comes in: HuggingFace processors should output dictionaries with keys that match the parameter names of model.forward().

To solve the prompt format problem for LLaVA, I think we have to also deal with generating the attention masks in the processing framework. That would mean abstracting some of the logic of ModelRunner._prepare_prompt.

Apr 22 '24 10:04 DarkLight1337

Just a heads up that #4228 will introduce another vision language model to vLLM, so our discussion should take that into account as well.

Apr 22 '24 10:04 DarkLight1337

I discussed with @zhuohan123 offline about this - in particular regarding this comment

To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine.

If vLLM's going to use out-of-box AutoProcessor (which includes tokenizer) anyways, then it's logical to make it an attribute of the engine (similar to what we did with tokenizer). As of now for the sake of simplicity, we could add something like self.processor = AutoProcessor(model_id) to this section if the model is an MM model. https://github.com/vllm-project/vllm/blob/15436806912d7ad9371c8bcf6a46857590c107d2/vllm/engine/llm_engine.py#L136-L139

then at inference time, depending on if the request has multi-modal data or not, we process with it with either self.tokenizer or self.processor.

(IMO eventually, there really shouldn't be a separation between how we preprocess text data and multi-modal data as they should all go through one InputProcessor class, but that is probably a bigger engineering refactoring that we can leave for later.)

We can also add an additional parameter on the engine level to indicate that we're feeding the engine an already processed dictionary of tensors, so the preprocessing step with self.processor will be skipped. (Very similar to prompt vs prompt_token_ids)

@DarkLight1337 @Isotr0py WDYT? Do you see any issue with this design?

Apr 22 '24 17:04 ywang96

I discussed with @zhuohan123 offline about this - in particular regarding this comment

To avoid having to modify the core Engine logic each time, we can wrap the data with processor objects (with a common interface to process the data) before passing them into the Engine.

If vLLM's going to use out-of-box AutoProcessor (which includes tokenizer) anyways, then it's logical to make it an attribute of the engine (similar to what we did with tokenizer). As of now for the sake of simplicity, we could add something like self.processor = AutoProcessor(model_id) to this section if the model is an MM model.

https://github.com/vllm-project/vllm/blob/15436806912d7ad9371c8bcf6a46857590c107d2/vllm/engine/llm_engine.py#L136-L139

then at inference time, depending on if the request has multi-modal data or not, we process with it with either self.tokenizer or self.processor.

(IMO eventually, there really shouldn't be a separation between how we preprocess text data and multi-modal data as they should all go through one InputProcessor class, but that is probably a bigger engineering refactoring that we can leave for later.)

We can also add an additional parameter on the engine level to indicate that we're feeding the engine an already processed dictionary of tensors, so the preprocessing step with self.processor will be skipped. (Very similar to prompt vs prompt_token_ids)

@DarkLight1337 @Isotr0py WDYT? Do you see any issue with this design?

This is somewhat similar to #4166 where I load the processing logic using AutoProcessor instead of AutoTokenizer for testing the HuggingFace implementation.

I think one potential issue of this design is that the direct dependency on HuggingFace (which we have no control over) would complicate efforts to apply additional preprocessing specific to certain HuggingFace processors (e.g. to adapt to our interface).

Since @Isotr0py 's comment, I have refactored the code in #4197 into using a registry pattern to apply the preprocessor, so that MultiModalData class itself no longer has any preprocessing logic.

Apr 23 '24 00:04 DarkLight1337

@DarkLight1337 Thanks for sharing the thoughts! @zhuohan123 and I actually discussed about the use of AutoProcessor.

I think the point is that today vLLM already relies on AutoTokenizer, and most of model implementations we have in vLLM today are based on the implementation of such models in transformers, so I don't really think having this dependency is a big issue. Using AutoProcessor also allows us to abstract away from image in particular so that the same interface will work for other modalities (e.g, whisper) as well.

The original design of the prompt interface isn't very clean, and is very specific to LlaVa-1.5. I would like to emphasize that not every MM model has a "vision tower + projector + LM" architecture, so IMO the input format should really be one of raw inputs (images), processed inputs (outputs of autoprocessor) or embeddings (prompt embeddings + MM embeddings).

I will also be working on a PR so we can cross review each other's work.

Apr 23 '24 06:04 ywang96

One thing to add is that we would like to keep vLLM's end-user API easy to use. Having AutoProcessor outside of vLLM requires the user to create and pick the correct Processor for the specific model they are using, which can be error-prone. So I lean towards having AutoProcessor in vLLM and an end user can directly feed in the raw image (e.g. like a jpg image) to vLLM.

Apr 23 '24 06:04 zhuohan123

@DarkLight1337 Thanks for sharing the thoughts! @zhuohan123 and I actually discussed about the use of AutoProcessor.

I think the point is that today vLLM already relies on AutoTokenizer, and most of model implementations we have in vLLM today are based on the implementation of such models in transformers, so I don't really think having this dependency is an big issue. Using AutoProcessor also allows us to abstract away from image in particular so that the same interface will work for other modalities (e.g, whisper) as well.

The original design of the prompt interface isn't very clean, and is very specific to LlaVa-1.5. I would like to emphasize that not every MM model has a "vision tower + projector + LM" architecture, so IMO the input format should really be one of raw inputs (images), processed inputs (outputs of autoprocessor) or embeddings (prompt embeddings + MM embeddings).

I will also be working on a PR so we can cross review each other's work.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects. How can we integrate this into vLLM's computation of the attention masks?

Apr 23 '24 06:04 DarkLight1337

Regarding #4228, I think there may be a situation that some MM models don't have a Processor implemented.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects.

@DarkLight1337 IMO, there may be a solution that we can inherit and modify the LLaVA processor to handle num_features calculation and inputs_ids padding etc, so that it can create the right attention masks from current attention masks computation codes.

Apr 23 '24 07:04 Isotr0py

Regarding #4228, I think there may be a situation that some MM models don't have a Processor implemented.

In this case, we would have to refactor the computation of attention masks so that it can accept single <image> token for LLaVA, since that is what its HuggingFace processor expects.

@DarkLight1337 IMO, there may be a solution that we can inherit and modify the LLaVA processor to handle num_features calculation and inputs_ids padding etc, so that it can create the right attention masks from current attention masks computation codes.

I like the idea of simply inheriting from the existing HuggingFace processor. How should we ensure that our implementation is loaded instead of the HuggingFace one?

Apr 23 '24 07:04 DarkLight1337

Also, I think that we should wrap the input prompt to LLM.generate in order to better distinguish the kwargs to pass to the HF processor from the other arguments to LLM.generate. It is rather awkward right now that we have to pass a list of multi-modal data with length equal to the input prompts. If we use HF processor directly, the multi-modal inputs would become part of those kwargs instead of a separate MultiModalData instance.

Edit: Opened #4328

Apr 23 '24 07:04 DarkLight1337

~~I have noticed when using distributed inference on LLaVA-NeXT (#4199), there is a bug where the image tokens are not sent to the workers, resulting in an error when trying to merge the vision embeddings. This doesn't happen with LLaVA-1.5 because the model can be loaded inside a single GPU. Does anyone have a setup where LLaVA-1.5 is loaded across multiple GPUs to check whether this issue occurs in the existing vLLM code as well?~~

Edit: Nevermind, it's just a typo in the chat template I passed to the command for running the OpenAI-compatible server. To avoid such confusion in the future, I have opened #4292 to detect whether the string looks like a file path.

Apr 23 '24 09:04 DarkLight1337

How should we ensure that our implementation is loaded instead of the HuggingFace one?

I think we can refer to get_config() in transformers_utils/config.py, but searching registried processor firstly then AutoProcessor, so that the get_processor() could be:

def get_processor(model: str,
               model_type: str,
               trust_remote_code: bool,
               revision: Optional[str] = None,
               code_revision: Optional[str] = None) -> ProcessorMixin:
    if model_type in _PROCESSOR_REGISTRY:
        processor_class = _PROCESSOR_REGISTRY[model_type]
        processor = processor_class.from_pretrained(model,
                                              revision=revision,
                                              code_revision=code_revision)
        return processor
    try:
        processor = AutoProcessor.from_pretrained(
            model,
            trust_remote_code=trust_remote_code,
            revision=revision,
            code_revision=code_revision)
    except ValueError as e:
        # do something else

Apr 23 '24 11:04 Isotr0py

I think we can refer to get_config() in transformers_utils/config.py, but searching registried processor firstly then AutoProcessor, so that the get_processor() could be:

def get_processor(model: str,
               model_type: str,
               trust_remote_code: bool,
               revision: Optional[str] = None,
               code_revision: Optional[str] = None) -> ProcessorMixin:
    if model_type in _PROCESSOR_REGISTRY:
        processor_class = _PROCESSOR_REGISTRY[model_type]
        processor = processor_class.from_pretrained(model,
                                              revision=revision,
                                              code_revision=code_revision)
        return processor
    try:
        processor = AutoProcessor.from_pretrained(
            model,
            trust_remote_code=trust_remote_code,
            revision=revision,
            code_revision=code_revision)
    except ValueError as e:
        # do something else

To be honest, I'm not a big fan of having to potentially add multiple files in different places* for each new model, but I guess that would work for now. Further down the line, we could consider adopting a more explicit interface for adding new models to vLLM.

*Currently, we have to add a new file in model_executor/models and possibly transformers_utils/configs. After adding multi-modal support, we also have to worry about transformers_utils/processors.

Apr 23 '24 13:04 DarkLight1337

Hi guys,

It seems the current prefix caching does not work on multi modal model like llava because the hash of block only takes previous token ids and the image patch token is always the same but the corresponding images might differ.

Do you have any ideas about:

How to make the kv block of multi modal request have correct hash id?
How to inherit the kv cache blocks with images? Image might take several blocks but it is impossible to only inherit part of it because the image encoder need to take a whole image.

Thanks!

Apr 24 '24 06:04 Oliver-ss

For reference, I have compiled a list of GH issues that are related to this topic (updated periodically):

Multi-modal core:

Encoder-decoder support (cross-attention across text only)
- #5934
  - #9555
  - #13320
Image embedding model
- #8195
- #10197
- #13663
Beyond text generation
- #9873
- #10404
- #11964
- #11968
- #12479
- #12658

Multi-modal features:

Quantization
- #8463
- #9324

Multi-modal performance:

#9190
#9283
#9483

Multi-modal models:

#3519 & #6265
#6805
#7863
#3356 & #4982
- #5817
#4124 [HF Transformers: Idefics2Model]
- #4937
Moondream
- #4228
#8153 [HF Transformers: VisionEncoderDecoderModel]
#8972 & #9638 & #13251 & #13441
- #11240
#9606
#9707
#11008
#11887
#12073
#12108
#12444
#13172
#13190

Update: To reduce clutter, I will be removing finished items from this list from now onwards.

May 09 '24 11:05 DarkLight1337

Hi folks, just stumbled across this issue. It's great to see the proposed steps here as these were all things I ran into when implementing support for our multimodal audio model - in particular, not requiring pre-padding of the input with the special token is a huge simplification when the input (audio in our case) can be of arbitrary size.

The one other thing that might make sense here would be a rename of VisionLanguageConfig to MultimodalLanguageConfig or something similar, rather than having to add separate AudioLanguageConfig and other types in the future throughout all the layers of vLLM. This should be fairly straightforward with your refactoring as I think the only thing remaining in VisionLanguageConfig that we will care about will be the special token id.

May 09 '24 20:05 juberti

The current prompt format "<Image>" * 576 + prompt makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format "<Image>\n" + prompt and that has caused some confusion on what's needed to make multi-model work on vLLM.

While messing with the OpenAI GPT-4V API, I found that this is quite vulnerable to token injection where the user includes <image> tokens in the text prompt. This causes the model to crash due to mismatched shapes when filling in the image token positions with the embeddings.

May 16 '24 01:05 DarkLight1337

For folks who came across this RFC, I have been working closely with @DarkLight1337 on several PRs:

[x] #4910
[x] #4328
[x] #4197
[x] #5237

The goal is to support end-to-end GPT4V-compatible inference by the upcoming major release/meetup.

On parallel, I have been also helping on #4937 , and hopefully can get it merged before the upcoming major release too.

May 23 '24 07:05 ywang96

Hi forks, I think when we want to refactor the code, we should not only consider the multi modal input, but also the multi modal output.

The current vllm seems only support text output, what about image/audio output?
Take audio output as example: there is a audio codec decoder(equivalent to text tokenizer decoder) to decode the audio tokens back to audio signal. how can we fuse the audio codec model into vllm to provide a seamless user experience?

Think about gpt4o, can we have a vllm to host the gpt4o level model?

May 30 '24 02:05 nukes

Hi forks, I think when we want to refactor the code, we should not only consider the multi modal input, but also the multi modal output.

Hey @nukes! I personally totally agree with this, but for now multi-modal output isn't yet in the scope of the project until there's a reasonable amount of open source model support and interest in it.

As usual, this is an open-sourced project so any contribution/suggestion is welcomed!

May 30 '24 02:05 ywang96

Can vllm support direct input of inputs_emb now? If so, we can leverage vllm's inference capabilities with minimal changes to the model inference code. Moreover, since model architectures are diverse, having vllm support direct input of inputs_emb would greatly enhance its applicability. Otherwise, each new model would require redevelopment, which is time-consuming and labor-intensive.

In the generate method, can it support input of embeddings? How is the progress on this?

May 30 '24 03:05 AmazDeng

vllm vllm copied to clipboard

[RFC]: Multi-modality Support on vLLM

Re: Issues

Re: Proposals

1. Adding a processor util

2. Frontend input format

3. Refactor MultiModalData

4. Refactor VisionLanguageConfig

5. Regarding the original image_feature as input type design

2. Frontend input format

vllm
vllm copied to clipboard

3. Refactor `MultiModalData`

4. Refactor `VisionLanguageConfig`

5. Regarding the original `image_feature` as input type design