modelmesh-serving Specify minimum number of copies for InferenceService

Specify minimum number of copies for InferenceService

Open Kokkini opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? If so, please describe.

We have 3 GPU instances and lots of models. A few of them are frequently used and need low latency. Currently, when requests come, ModelMesh takes 10 seconds to scale up to 3 and the latency is bad. We want these models to always have 3 copies to reduce the latency. The rest of the models can have 1 copy as usual.

In the future, we also want to scale up to many more GPUs. Many models will need to have multiple copies always available to handle the high load.

Describe your proposed solution Could you add an option minimumCopies in InferenceService. This controls the minimum number of copies of a model. ModelMesh won't scale the model to fewer than minimumCopies.

Describe alternatives you have considered

Additional context

Jun 07 '23 09:06 Kokkini

modelmesh-serving modelmesh-serving copied to clipboard

Specify minimum number of copies for InferenceService

modelmesh-serving
modelmesh-serving copied to clipboard