vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Support Multiple Models

Open aldrinc opened this issue 1 year ago • 22 comments

  • Allow user to specify multiple models to download when loading server
  • Allow user to switch between models
  • Allow user to load multiple models on the cluster (nice to have)

aldrinc avatar Jun 28 '23 19:06 aldrinc

For first and second feature request, why don't you just kill the old server and start a new one with the new model?

zhuohan123 avatar Jun 28 '23 20:06 zhuohan123

It’s what we are doing now but takes long time to download and load large models (33/65b). For train to deploy pipeline not possible to have zero downtime without multiple servers and blue green deployment strategy.

aldrinc avatar Jun 28 '23 21:06 aldrinc

The downloading should only happen at the very first time when using a model. However, the loading cost is unavoidable. Are you looking for something that can swap the model with zero downtime?

zhuohan123 avatar Jun 28 '23 21:06 zhuohan123

I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it on vLLM.

gesanqiu avatar Jun 30 '23 12:06 gesanqiu

Is there any progress on this feature?

ft-algo avatar Sep 27 '23 13:09 ft-algo

+1

shixianc avatar Oct 16 '23 18:10 shixianc

+1

Ki6an avatar Oct 26 '23 11:10 Ki6an

+1 For enterprise, instead of one monolithic, API-based LLM like GPT4, their strategy may be a collection of SLMs dedicated/fine-tuned to specific tasks. THis is why they will want to serve multiple models simultaneously, for e.g. phi2/mistral-7B/yi-34B. Could we have an update on this feature request please?

corticalstack avatar Jan 25 '24 12:01 corticalstack

+1

capybarahero avatar Feb 28 '24 17:02 capybarahero

+1 how is the progress of this feature

chanchimin avatar May 01 '24 09:05 chanchimin

+1

lenartgolob avatar May 15 '24 11:05 lenartgolob

+1

Shamepoo avatar Jun 05 '24 03:06 Shamepoo

+1

mjtechguy avatar Jun 06 '24 16:06 mjtechguy

+1

jvlinsta avatar Jun 18 '24 11:06 jvlinsta

+1

mohittalele avatar Jun 24 '24 07:06 mohittalele

+1

ptrmayer avatar Jun 24 '24 08:06 ptrmayer

++++1

tarson96 avatar Jun 24 '24 19:06 tarson96

+1

ptrmayer avatar Jun 25 '24 05:06 ptrmayer

+1

Luffyzm3D2Y avatar Jun 27 '24 09:06 Luffyzm3D2Y

+1 desperately need multiple models

srzer avatar Jul 02 '24 23:07 srzer

+1

lizhipengpeng avatar Jul 09 '24 06:07 lizhipengpeng

+1

amitm02 avatar Jul 09 '24 07:07 amitm02

+1

naturomics avatar Jul 16 '24 02:07 naturomics

What is the general thought process or strategy into implementing something like this if this is open for contributions? Does the vLLM team or anyone have a roadmap or idea into it's implementation that someone can pick up?

This would be highly useful. All I can quickly think of is how tensorflow allows to specify memory used something similar to gpu util here on vLLM or using subprocesses within python (although I must admit I do not understand much about this topic in general anyway)

servient-ashwin avatar Jul 18 '24 17:07 servient-ashwin

+1

zr-idol avatar Aug 07 '24 04:08 zr-idol

One work around that I notice vLLM already gives is using Docker containers. Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.

Sk4467 avatar Aug 07 '24 07:08 Sk4467

+++1

ws-chenc avatar Aug 09 '24 03:08 ws-chenc

+1+1

chen-j-ing avatar Aug 09 '24 03:08 chen-j-ing