vllm Can vllm serving clients by using multiple model instances?

Can vllm serving clients by using multiple model instances?

Open aoyulong opened this issue 2 years ago • 1 comments

Based on the examples, vllm can launch a server with a single model instances. Can vllm serving clients by using multiple model instances? With multiple model instances, the sever will dispatch the requests to different instances to reduce the overhead.

Jun 21 '23 07:06 aoyulong

Right now vLLM is a serving engine for a single model. You can start multiple vLLM server replicas and use a custom load balancer (e.g., nginx load balancer). Also feel free to checkout FastChat and other multi-model frontends (e.g., aviary). vLLM can be a model worker of these libraries to support multi-replica serving.

Jun 21 '23 10:06 zhuohan123

vllm vllm copied to clipboard

Can vllm serving clients by using multiple model instances?

vllm
vllm copied to clipboard