vllm icon indicating copy to clipboard operation
vllm copied to clipboard

What's the difference between vllm and triton-inference-server?

Open gesanqiu opened this issue 2 years ago • 5 comments

May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate our deploy work, thx.

gesanqiu avatar Jun 21 '23 06:06 gesanqiu

Thanks for your interest! vLLM is an inference and serving engine/backend like FasterTransformer, but is highly optimized for serving throughput. We provide FastAPI and OpenAI API-compatible servers for convenience, but plan to add an integration layer with serving systems such as NVIDIA Triton and Ray Serve for those who want to scale out the system.

WoosukKwon avatar Jun 21 '23 06:06 WoosukKwon

Thanks for your response. So can I assume vLLM will server like a backend in NVIDIA Triton? I wandering whether the serving part will be overlapped with NVIDIA Triton's capabilities?

gesanqiu avatar Jun 21 '23 07:06 gesanqiu

PagedAttention requires batching multiple requests together to achieve high throughput and we need to keep the batching logic within vLLM as well. This is typically not included in an NVIDIA Triton backend, which typically only handles inference on a single batch. From this perspective, vLLM is more than a typical NVIDIA Triton backend.

However, we will mostly focus on building a core LLM serving engine and keep most frontend functionalities (e.g., fault tolerance, auto-scaling, multiple frontends including GRPC, ...) within other serving frontends (e.g., NVIDIA Triton, Ray Serve, ...). We will just focus on making LLM inference and serving lightning fast and cheap.

zhuohan123 avatar Jun 21 '23 15:06 zhuohan123

There is dynamic batching in NVIDIA Triton. Is this somehow different from what vLLM does?

eyusupov avatar Jun 23 '23 04:06 eyusupov

There is dynamic batching in NVIDIA Triton. Is this somehow different from what vLLM does?

This blog post from anyscale explains in detail what's the difference between "dynamic batching" in Triton and "continuous batching" in vLLM. In a nutshell, "dynamic batching" is designed mainly for traditional NNs (e.g., CNN), where the NNs receive fix-sized inputs and the system decides how many inputs to batch for each iteration. However, "continuous batching" is specifically designed for LLMs and language sequences, which batches individual tokens of different sequences for each iteration.

zhuohan123 avatar Jun 23 '23 08:06 zhuohan123