vllm
vllm copied to clipboard
Support Multiple Models
- Allow user to specify multiple models to download when loading server
- Allow user to switch between models
- Allow user to load multiple models on the cluster (nice to have)
For first and second feature request, why don't you just kill the old server and start a new one with the new model?
It’s what we are doing now but takes long time to download and load large models (33/65b). For train to deploy pipeline not possible to have zero downtime without multiple servers and blue green deployment strategy.
The downloading should only happen at the very first time when using a model. However, the loading cost is unavoidable. Are you looking for something that can swap the model with zero downtime?
I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it on vLLM.
Is there any progress on this feature?
+1
+1
+1 For enterprise, instead of one monolithic, API-based LLM like GPT4, their strategy may be a collection of SLMs dedicated/fine-tuned to specific tasks. THis is why they will want to serve multiple models simultaneously, for e.g. phi2/mistral-7B/yi-34B. Could we have an update on this feature request please?
+1
+1 how is the progress of this feature
+1
+1
+1
+1
+1
+1
++++1
+1
+1
+1 desperately need multiple models
+1
+1
+1
What is the general thought process or strategy into implementing something like this if this is open for contributions? Does the vLLM team or anyone have a roadmap or idea into it's implementation that someone can pick up?
This would be highly useful. All I can quickly think of is how tensorflow allows to specify memory used something similar to gpu util
here on vLLM or using subprocesses within python (although I must admit I do not understand much about this topic in general anyway)
+1
One work around that I notice vLLM already gives is using Docker containers. Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.
+++1
+1+1