InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

How to deploy InternVL?

Open Iven2132 opened this issue 1 year ago • 8 comments

Great work building InternVL, I'm looking to deploy its inference as an endpoint. I wonder if anyon could help me with that. vLLM and TGI don't support that. What's your suggestion? Please let me know.

Iven2132 avatar Apr 27 '24 05:04 Iven2132

Check https://github.com/InternLM/lmdeploy/pull/1490

utkarsh995 avatar Apr 30 '24 11:04 utkarsh995

Check InternLM/lmdeploy#1490

check https://github.com/InternLM/lmdeploy/issues/1495#:~:text=Also%2C%20I%20got,the%20chat%20template.

Iven2132 avatar Apr 30 '24 11:04 Iven2132

LMDeploy v0.4.1 can help deploying InternVL. This is a guide https://github.com/OpenGVLab/InternVL/pull/152

lvhan028 avatar May 07 '24 13:05 lvhan028

LMDeploy v0.4.1 can help deploying InternVL. This is a guide #152

hi,@lvhan028 Can you give us some suggestions(tips) for how to use it on V100 GPUs?

BIGBALLON avatar May 07 '24 14:05 BIGBALLON

When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.

Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.

So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.

We are working on this feature and will release it by the end of this month. Stay tuned.

lvhan028 avatar May 07 '24 15:05 lvhan028

@lvhan028 If we only consider model inference, which one is faster, LMDeploy or swift?

BIGBALLON avatar May 08 '24 08:05 BIGBALLON

I didn't find the inference performance benchmark in swift repo.

If only considering the LLM part, LMDeploy can achieve 25 RPS using the SharedGPT dataset, which is nearly 2x faster than vLLM.

But LMDeploy didn't optimize the inference of the vision model. Vision model's optimzation is beyond the scope of LMDeploy and we don't have plans to do that.

lvhan028 avatar May 08 '24 09:05 lvhan028

When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.

Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.

So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.

We are working on this feature and will release it by the end of this month. Stay tuned.

So if it possible to use a single V100(32G) GPU to deploy InternVL-Chat-V1-5-Int8, if so, how do we set the tp and k/v cache?

BIGBALLON avatar May 09 '24 05:05 BIGBALLON