vllm
vllm copied to clipboard
[Model] Adding support for MiniCPM-V
Adding support for MiniCPM-V-2, please review. HuggingFace Page: https://huggingface.co/openbmb/MiniCPM-V-2
There's an incompatible pip's dependency error, the questions are listed as follows:
- MiniCPM-V need Timm package, where should I add this dependency requirement? I can see many different requirements files in the root dictionary of vllm.
- Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?
seems to be related with @ywang96 RFC https://github.com/vllm-project/vllm/issues/4194 on multi-modality models.
Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?
We can't do anything until timm has the same dependency as vllm. Or you can try to remove timm dependency.
Sry, we were confused by this situation. Actually, timm only requires torch >= 1.7 and we've add this dependency in requirements-common.txt. Please review. @youkaichao @ywang96
@HwwwwwwwH Thanks for your excellent work, may I ask what is preventing the progress of this PR?
Very sry for late!! We've been working on the new VLM MiniCPM-V-2.5 last few days.
I've pushed the new commit according to the reviews. And I see some new features about VLM, is there any requirements for adapting these features?
Really sry~
ping @ywang96
We should probably get this merged - I will review it, thanks for the ping @jeejeelee
@HwwwwwwwH Thanks for this great work , by the way, are there any plans to integrate MiniCPM-V-2.5 into vLLM?
@HwwwwwwwH Thanks for this great work , by the way, are there any plans to integrate MiniCPM-V-2.5 into vLLM?
We're working on this. Maybe it will be integrated in MiniCPMV in vLLM and distinguished by config files for using.
Is MiniCPM-Llama3-V 2.5 supported? I tried it and it doesn't seem to work. I used https://github.com/OpenBMB/vllm.
Is MiniCPM-Llama3-V 2.5 supported? I tried it and it doesn't seem to work. I used https://github.com/OpenBMB/vllm.
Not yet, @DarkLight1337 and I have been working on the refactoring and we're about to merge a pretty big PR that would require slightly more work for developers to add their new models. See #4197
@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?
@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?
Sry, for this... I thought I need to waiting for the last commit to be merged. And then I'll start a new PR for MiniCPM-Llama3-V 2.5. I don't know if things are as I thought.
@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?
Sry, for this... I thought I need to waiting for the last commit to be merged. And then I'll start a new PR for MiniCPM-Llama3-V 2.5. I don't know if things are as I thought.
@ywang96 Sry to bother. Is there anything left I need to do to merge MiniCPMV?
Can you add a test to check the model's consistency with its HuggingFace counterpart? That would be the most straightforward way to verify the correctness of your implementation.
Afterwards, please edit test-pipeline.yaml
and multimodal tests in the same way as #5189 to include them in the CI.
By the way, there have been a few changes to multimodal models in the past few weeks, so you should merge main
into your branch first.
Almost forgot - please also update vllm.multimodal.get_full_image_text_prompt
with your model, otherwise, it won't be available via OpenAI API.
Almost forgot - please also update
vllm.multimodal.get_full_image_text_prompt
with your model, otherwise, it won't be available via OpenAI API.
Got it! Thank you!
By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of <image>
tokens, we wish it could be done in the model. But we found that the length of input_ids
can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py
) to calculate the num of <image>
, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.
any update?
By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of
<image>
tokens, we wish it could be done in the model. But we found that the length ofinput_ids
can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py
) to calculate the num of<image>
, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.
Sorry I didn't read your edit. Does your model have dynamic number of image tokens based on the input size? That is not supported yet. For now, you'll have to assume a fixed size just like the case for LLaVA-NeXT.
By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of
<image>
tokens, we wish it could be done in the model. But we found that the length ofinput_ids
can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py
) to calculate the num of<image>
, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.Sorry I didn't read your edit. Does your model have dynamic number of image tokens based on the input size? That is not supported yet. For now, you'll have to assume a fixed size just like the case for LLaVA-NeXT.
This is exactly what I want. Hope it will come soon. For now, We use pre-calculate the number of image tokens in example file. But for OpenAI API, maybe it's hard to adapt.
any update?
Sry...I'll push the modifications as soon as possible.
@DarkLight1337 @ywang96 These what we've done in the past few days:
- Merged the new main branch.
- Fixed a bug in minicpmv in slicing image.
- Added test for minicpmv in tests.
- Tested the consistency for vLLM and its huggingface version.
- Added minicpmv in vllm.multimodal.get_full_image_text_prompt.
- Added MiniCPM-Llama3-V-2_5[Almost Done]
There are still some questions should be mentioned:
- MiniCPM-V will slice image when a image sent to it. So it needs a mechanism to change the length of input tokens. As this is not supported in vLLM for now, we achieve this by calculating the placeholder outside vLLM and slice image in the model(Reference to
examples/minicpmv_example.py
). But this is difficult and complicated to move tovllm.multimodal.get_full_image_text_prompt
. As there's a todo task to move this function to model registry, can we wait for this to be done? - There's some little difference in MiniCPM-V and its HF version. we checked this question and found the differences occurs in MiniCPM(not MiniCPM-V). And it comes from the MLP or Linear implementations. For example, in logits computing, vLLM use the weight of linear layer to multiply
hidden_states
and the other useforward
directly. The results of these two methods are different($10^{-3}~10^{-4}$ forbfloat16
). This makes the results of two version different. For validate the correctness of MiniCPM-V in vLLM, we tested it in some benchmarks and it gets the almost the same score compared to its original version in HF. Therefore, I didn't add the test file(test/test_minicpmv.py
) totest-pipeline.yaml
. Please check this and modify as you want. - The intergration of MiniCPM-Llama3-V-2_5 is almost done. Before that, could we merge this to main branch? It will be easier to update the code that you've reviewed.
At last, It takes a long time since I started this PR. I'm very sorry for the delays that caused by me. Wish the remaining things can go smoothly.~
Thanks for the update!
MiniCPM-V will slice image when a image sent to it. So it needs a mechanism to change the length of input tokens. As this is not supported in vLLM for now, we achieve this by calculating the placeholder outside vLLM and slice image in the model(Reference to
examples/minicpmv_example.py
). But this is difficult and complicated to move tovllm.multimodal.get_full_image_text_prompt
. As there's a todo task to move this function to model registry, can we wait for this to be done?
Can you convert this code into a HuggingFace ImageProcessor
, similar to how it's being handled in LLaVA-NeXT, and register it to AutoImageProcessor
? This would remove the need to manually preprocess the data beforehand, since the image processor is automatically invoked inside vLLM.
To avoid having to dynamically compute the number of image tokens, for now you can set image_feature_size
to be consistent with the image input shape which is also a fixed value according to the config. This image_feature_size
determines how many <image>
tokens should be included in the input.
There's some little difference in MiniCPM-V and its HF version. we checked this question and found the differences occurs in MiniCPM(not MiniCPM-V). And it comes from the MLP or Linear implementations. For example, in logits computing, vLLM use the weight of linear layer to multiply
hidden_states
and the other useforward
directly. The results of these two methods are different(10−3 10−4 forbfloat16
). This makes the results of two version different. For validate the correctness of MiniCPM-V in vLLM, we tested it in some benchmarks and it gets the almost the same score compared to its original version in HF. Therefore, I didn't add the test file(test/test_minicpmv.py
) totest-pipeline.yaml
. Please check this and modify as you want.
Can you post an example output for the two implementations? If there is only a little divergence between the outputs, you can set max_tokens
to a smaller value like the case for Phi-3V so that the test still passes.
The intergration of MiniCPM-Llama3-V-2_5 is almost done. Before that, could we merge this to main branch? It will be easier to update the code that you've reviewed.
Let's address the above two concerns first.
To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....
To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....
Thanks for your excellent work. May I ask when it will be completed?
#5276 has been merged, so you can start to make changes accordingly to support dynamic number of image tokens. You may refer to this guide for more details.
To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....
Thanks for your excellent work. May I ask when it will be completed?
Before next Tuesday? I need a weekend to finish this...
Hi, I've completed image_processor of MiniCPMV and thought this was invoke in model_runner._prepare_model_input_tensors
https://github.com/vllm-project/vllm/blob/main/vllm/worker/model_runner.py#L532.
And the new multimodal features add register_input_processor, which in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py#L169 and https://github.com/vllm-project/vllm/blob/main/vllm/inputs/registry.py#L184. What's the relationship and difference between these two processings?