vllm [Model] Adding support for MiniCPM-V

Adding support for MiniCPM-V-2, please review. HuggingFace Page: https://huggingface.co/openbmb/MiniCPM-V-2

Apr 15 '24 10:04 HwwwwwwwH

There's an incompatible pip's dependency error, the questions are listed as follows:

MiniCPM-V need Timm package, where should I add this dependency requirement? I can see many different requirements files in the root dictionary of vllm.
Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?

Apr 16 '24 03:04 HwwwwwwwH

seems to be related with @ywang96 RFC https://github.com/vllm-project/vllm/issues/4194 on multi-modality models.

Apr 23 '24 17:04 youkaichao

Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?

We can't do anything until timm has the same dependency as vllm. Or you can try to remove timm dependency.

Apr 23 '24 17:04 youkaichao

Sry, we were confused by this situation. Actually, timm only requires torch >= 1.7 and we've add this dependency in requirements-common.txt. Please review. @youkaichao @ywang96

Apr 26 '24 03:04 HwwwwwwwH

@HwwwwwwwH Thanks for your excellent work, may I ask what is preventing the progress of this PR?

May 16 '24 02:05 jeejeelee

Very sry for late!! We've been working on the new VLM MiniCPM-V-2.5 last few days.

I've pushed the new commit according to the reviews. And I see some new features about VLM, is there any requirements for adapting these features?

Really sry~

May 27 '24 02:05 HwwwwwwwH

ping @ywang96

May 27 '24 05:05 jeejeelee

We should probably get this merged - I will review it, thanks for the ping @jeejeelee

May 27 '24 05:05 ywang96

@HwwwwwwwH Thanks for this great work , by the way, are there any plans to integrate MiniCPM-V-2.5 into vLLM?

May 27 '24 05:05 jeejeelee

@HwwwwwwwH Thanks for this great work , by the way, are there any plans to integrate MiniCPM-V-2.5 into vLLM?

We're working on this. Maybe it will be integrated in MiniCPMV in vLLM and distinguished by config files for using.

May 27 '24 06:05 HwwwwwwwH

Is MiniCPM-Llama3-V 2.5 supported? I tried it and it doesn't seem to work. I used https://github.com/OpenBMB/vllm.

Jun 03 '24 03:06 nqyy

Is MiniCPM-Llama3-V 2.5 supported? I tried it and it doesn't seem to work. I used https://github.com/OpenBMB/vllm.

Not yet, @DarkLight1337 and I have been working on the refactoring and we're about to merge a pretty big PR that would require slightly more work for developers to add their new models. See #4197

Jun 03 '24 03:06 ywang96

@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?

Jun 10 '24 14:06 jeejeelee

@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?

Sry, for this... I thought I need to waiting for the last commit to be merged. And then I'll start a new PR for MiniCPM-Llama3-V 2.5. I don't know if things are as I thought.

Jun 11 '24 09:06 HwwwwwwwH

@ywang96 @HwwwwwwwH Sorry to bother you again, but could you please let me know the current progress of this PR?

Sry, for this... I thought I need to waiting for the last commit to be merged. And then I'll start a new PR for MiniCPM-Llama3-V 2.5. I don't know if things are as I thought.

@ywang96 Sry to bother. Is there anything left I need to do to merge MiniCPMV?

Jun 13 '24 12:06 HwwwwwwwH

Can you add a test to check the model's consistency with its HuggingFace counterpart? That would be the most straightforward way to verify the correctness of your implementation.

Afterwards, please edit test-pipeline.yaml and multimodal tests in the same way as #5189 to include them in the CI.

Jun 13 '24 12:06 DarkLight1337

By the way, there have been a few changes to multimodal models in the past few weeks, so you should merge main into your branch first.

Jun 13 '24 12:06 DarkLight1337

Almost forgot - please also update vllm.multimodal.get_full_image_text_prompt with your model, otherwise, it won't be available via OpenAI API.

Jun 13 '24 14:06 DarkLight1337

Almost forgot - please also update vllm.multimodal.get_full_image_text_prompt with your model, otherwise, it won't be available via OpenAI API.

Got it! Thank you! By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of <image> tokens, we wish it could be done in the model. But we found that the length of input_ids can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py) to calculate the num of <image>, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.

Jun 14 '24 02:06 HwwwwwwwH

any update?

Jun 18 '24 07:06 morestart

By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of <image> tokens, we wish it could be done in the model. But we found that the length of input_ids can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py) to calculate the num of <image>, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.

Sorry I didn't read your edit. Does your model have dynamic number of image tokens based on the input size? That is not supported yet. For now, you'll have to assume a fixed size just like the case for LLaVA-NeXT.

Jun 18 '24 07:06 DarkLight1337

By the way, May I ask one more question? In our MiniCPMV, there's a mechanism to slice the input image. In our previous implementation, we put the slicing into the model. In other words, there's a calculation to get the number of <image> tokens, we wish it could be done in the model. But we found that the length of input_ids can not be changed in the forward process in vLLM. So we have to put a copy code of slicing also in the example file(examples/example_minicpmv.py) to calculate the num of <image>, this make it awkward to use. I've checked the new features of VLM supporting, but still don't to know how to solve it.

Sorry I didn't read your edit. Does your model have dynamic number of image tokens based on the input size? That is not supported yet. For now, you'll have to assume a fixed size just like the case for LLaVA-NeXT.

This is exactly what I want. Hope it will come soon. For now, We use pre-calculate the number of image tokens in example file. But for OpenAI API, maybe it's hard to adapt.

Jun 18 '24 07:06 HwwwwwwwH

any update?

Sry...I'll push the modifications as soon as possible.

Jun 18 '24 07:06 HwwwwwwwH

@DarkLight1337 @ywang96 These what we've done in the past few days:

Merged the new main branch.
Fixed a bug in minicpmv in slicing image.
Added test for minicpmv in tests.
Tested the consistency for vLLM and its huggingface version.
Added minicpmv in vllm.multimodal.get_full_image_text_prompt.
Added MiniCPM-Llama3-V-2_5[Almost Done]

There are still some questions should be mentioned:

MiniCPM-V will slice image when a image sent to it. So it needs a mechanism to change the length of input tokens. As this is not supported in vLLM for now, we achieve this by calculating the placeholder outside vLLM and slice image in the model(Reference to examples/minicpmv_example.py). But this is difficult and complicated to move to vllm.multimodal.get_full_image_text_prompt. As there's a todo task to move this function to model registry, can we wait for this to be done?
There's some little difference in MiniCPM-V and its HF version. we checked this question and found the differences occurs in MiniCPM(not MiniCPM-V). And it comes from the MLP or Linear implementations. For example, in logits computing, vLLM use the weight of linear layer to multiply hidden_states and the other use forward directly. The results of these two methods are different($10^{-3}~10^{-4}$ for bfloat16). This makes the results of two version different. For validate the correctness of MiniCPM-V in vLLM, we tested it in some benchmarks and it gets the almost the same score compared to its original version in HF. Therefore, I didn't add the test file(test/test_minicpmv.py) to test-pipeline.yaml. Please check this and modify as you want.
The intergration of MiniCPM-Llama3-V-2_5 is almost done. Before that, could we merge this to main branch? It will be easier to update the code that you've reviewed.

At last, It takes a long time since I started this PR. I'm very sorry for the delays that caused by me. Wish the remaining things can go smoothly.~

Jun 19 '24 09:06 HwwwwwwwH

Thanks for the update!

MiniCPM-V will slice image when a image sent to it. So it needs a mechanism to change the length of input tokens. As this is not supported in vLLM for now, we achieve this by calculating the placeholder outside vLLM and slice image in the model(Reference to examples/minicpmv_example.py). But this is difficult and complicated to move to vllm.multimodal.get_full_image_text_prompt. As there's a todo task to move this function to model registry, can we wait for this to be done?

Can you convert this code into a HuggingFace ImageProcessor, similar to how it's being handled in LLaVA-NeXT, and register it to AutoImageProcessor? This would remove the need to manually preprocess the data beforehand, since the image processor is automatically invoked inside vLLM.

To avoid having to dynamically compute the number of image tokens, for now you can set image_feature_size to be consistent with the image input shape which is also a fixed value according to the config. This image_feature_size determines how many <image> tokens should be included in the input.

There's some little difference in MiniCPM-V and its HF version. we checked this question and found the differences occurs in MiniCPM(not MiniCPM-V). And it comes from the MLP or Linear implementations. For example, in logits computing, vLLM use the weight of linear layer to multiply hidden_states and the other use forward directly. The results of these two methods are different(10−3 10−4 for bfloat16). This makes the results of two version different. For validate the correctness of MiniCPM-V in vLLM, we tested it in some benchmarks and it gets the almost the same score compared to its original version in HF. Therefore, I didn't add the test file(test/test_minicpmv.py) to test-pipeline.yaml. Please check this and modify as you want.

Can you post an example output for the two implementations? If there is only a little divergence between the outputs, you can set max_tokens to a smaller value like the case for Phi-3V so that the test still passes.

The intergration of MiniCPM-Llama3-V-2_5 is almost done. Before that, could we merge this to main branch? It will be easier to update the code that you've reviewed.

Let's address the above two concerns first.

Jun 19 '24 10:06 DarkLight1337

To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....

Jun 25 '24 01:06 HwwwwwwwH

To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....

Thanks for your excellent work. May I ask when it will be completed?

Jul 02 '24 05:07 ZHANG-SH97

#5276 has been merged, so you can start to make changes accordingly to support dynamic number of image tokens. You may refer to this guide for more details.

Jul 03 '24 10:07 DarkLight1337

To use ImageProcessor and Processor of HF, I need more few days to check our repo. (Working on this right now....

Thanks for your excellent work. May I ask when it will be completed?

Before next Tuesday? I need a weekend to finish this...

Jul 04 '24 03:07 HwwwwwwwH

Hi, I've completed image_processor of MiniCPMV and thought this was invoke in model_runner._prepare_model_input_tensors https://github.com/vllm-project/vllm/blob/main/vllm/worker/model_runner.py#L532. And the new multimodal features add register_input_processor, which in https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py#L169 and https://github.com/vllm-project/vllm/blob/main/vllm/inputs/registry.py#L184. What's the relationship and difference between these two processings?

Jul 08 '24 11:07 HwwwwwwwH

vllm vllm copied to clipboard

[Model] Adding support for MiniCPM-V

vllm
vllm copied to clipboard