kserve Are there plans to support DeepSpeed-Inference?

/kind feature

Describe the solution you'd like DeepSpeed-Inference introduces several features to efficiently serve transformer-based PyTorch models. It supports model parallelism (MP) to fit large models that would otherwise not fit in GPU memory. Even for smaller models, MP can be used to reduce latency for inference.

Personally I think that's great to support DeepSpeed-Inference or build example/flows in in kserve, comments?

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Jul 10 '23 09:07 jinchihe

@jinchihe you can run deepspeed inference with torchserve/kserve https://pytorch.org/serve/large_model_inference.html#deepspeed, @cmaddalozzo is also working on the native support from kserve side.

Jul 11 '23 13:07 yuzisun

Wow! Great news! Thanks @yuzisun @cmaddalozzo

Jul 16 '23 00:07 jinchihe

@yuzisun Please ask if there is a complete example of publishing the DeepSpeed model as a reasoning service using kserve

Dec 26 '23 09:12 FlyAIBox