vllm
vllm copied to clipboard
Can vllm serving clients by using multiple model instances?
Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will dispatch the requests to different instances to reduce the overhead.
Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving.