Support vLLM and deepspeed-fastgen for LLM inference

Open jinchihe opened this issue 2 years ago • 2 comments

/kind feature

Describe the solution you'd like Suggest supportting vLLM or deepspeed-fastgen for LLM inference, that's hot for inference projects now.

Anything else you would like to add: fastest LLM ineference support on kserver

Links to the design documents: [Optional, start with the short-form RFC template to outline your ideas and get early feedback.] [Required, use the longer-form design doc template to specify and discuss your design in more detail]

Jan 07 '24 06:01 jinchihe

Check out https://github.com/kserve/kserve/pull/3334, which provides a high level abstraction based on HF. vLLM backend will also be implemented.

Jan 08 '24 03:01 terrytangyuan

@jinchihe vllm support being added through this change - #3415

Feb 08 '24 08:02 gavrissh