[RFC] Support different inference engines like vLLM, SGLang, TensorRT-LLM
Summary
We aim to enhance the serving control plane by supporting multiple inference engines, including vLLM, SGLang, and TensorRT-LLM. These engines have distinct performance optimizations and use cases, and integrating them will allow users to leverage the strengths of each engine based on their specific needs. This feature is crucial for a diverse and scalable LLM infrastructure that caters to both low-latency and high-throughput requirements.
By supporting multiple engines, the control plane will provide flexibility for users to choose the best engine for their workload, optimizing performance, cost, and scalability. It will also increase the robustness of the system by allowing seamless integration of different engine capabilities under a unified serving framework.
Motivation
One inference engine can not support all the use cases, we already meet few issues.
Proposed Change
Proposed Solution
- Implement engine-agnostic APIs within the control plane to abstract the underlying inference engine.
- Provide a flexible engine selection mechanism in the control plane configuration, allowing users to switch between vLLM, SGLang, TensorRT-LLM, or other engines as needed.
- Ensure the control plane (ai runtime) handles engine-specific resource requirements, inference optimizations, and result consistency across different engines.
- Develop a plugin-based architecture that allows easy integration of new engines in the future.
Alternatives Considered
No response
This is a pretty cool feature. Let us discuss!
@xieus @Jeffwan Hi 👋,I hope to support SGlang through LWS. https://github.com/vllm-project/aibrix/issues/843 If no one else has done this, please assign the task to me.
@Belyenochi We build storm service to support advanced orchestration scenario and provide production grade rolling upgrade and roll out strategies. We plan not to use LWS for such cases. If you are interested in the new solution, I can help give some context