vllm Support Multiple Models

Allow user to specify multiple models to download when loading server
Allow user to switch between models
Allow user to load multiple models on the cluster (nice to have)

Jun 28 '23 19:06 aldrinc

For first and second feature request, why don't you just kill the old server and start a new one with the new model?

Jun 28 '23 20:06 zhuohan123

It’s what we are doing now but takes long time to download and load large models (33/65b). For train to deploy pipeline not possible to have zero downtime without multiple servers and blue green deployment strategy.

Jun 28 '23 21:06 aldrinc

The downloading should only happen at the very first time when using a model. However, the loading cost is unavoidable. Are you looking for something that can swap the model with zero downtime?

Jun 28 '23 21:06 zhuohan123

I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it on vLLM.

Jun 30 '23 12:06 gesanqiu

Is there any progress on this feature?

Sep 27 '23 13:09 ft-algo

+1

Oct 16 '23 18:10 shixianc

+1

Oct 26 '23 11:10 Ki6an

+1 For enterprise, instead of one monolithic, API-based LLM like GPT4, their strategy may be a collection of SLMs dedicated/fine-tuned to specific tasks. THis is why they will want to serve multiple models simultaneously, for e.g. phi2/mistral-7B/yi-34B. Could we have an update on this feature request please?

Jan 25 '24 12:01 corticalstack

+1

Feb 28 '24 17:02 capybarahero

+1 how is the progress of this feature

May 01 '24 09:05 chanchimin

+1

May 15 '24 11:05 lenartgolob

+1

Jun 05 '24 03:06 Shamepoo

+1

Jun 06 '24 16:06 mjtechguy

+1

Jun 18 '24 11:06 jvlinsta

+1

Jun 24 '24 07:06 mohittalele

+1

Jun 24 '24 08:06 ptrmayer

++++1

Jun 24 '24 19:06 tarson96

+1

Jun 25 '24 05:06 ptrmayer

+1

Jun 27 '24 09:06 Luffyzm3D2Y

+1 desperately need multiple models

Jul 02 '24 23:07 srzer

+1

Jul 09 '24 06:07 lizhipengpeng

+1

Jul 09 '24 07:07 amitm02

+1

Jul 16 '24 02:07 naturomics

What is the general thought process or strategy into implementing something like this if this is open for contributions? Does the vLLM team or anyone have a roadmap or idea into it's implementation that someone can pick up?

This would be highly useful. All I can quickly think of is how tensorflow allows to specify memory used something similar to gpu util here on vLLM or using subprocesses within python (although I must admit I do not understand much about this topic in general anyway)

Jul 18 '24 17:07 servient-ashwin

+1

Aug 07 '24 04:08 zr-idol

One work around that I notice vLLM already gives is using Docker containers. Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.

Aug 07 '24 07:08 Sk4467

+++1

Aug 09 '24 03:08 ws-chenc

+1+1

Aug 09 '24 03:08 chen-j-ing

vllm vllm copied to clipboard

Support Multiple Models

vllm
vllm copied to clipboard