modelmesh-serving
modelmesh-serving copied to clipboard
Specify minimum number of copies for InferenceService
Is your feature request related to a problem? If so, please describe.
We have 3 GPU instances and lots of models. A few of them are frequently used and need low latency. Currently, when requests come, ModelMesh takes 10 seconds to scale up to 3 and the latency is bad. We want these models to always have 3 copies to reduce the latency. The rest of the models can have 1 copy as usual.
In the future, we also want to scale up to many more GPUs. Many models will need to have multiple copies always available to handle the high load.
Describe your proposed solution
Could you add an option minimumCopies
in InferenceService. This controls the minimum number of copies of a model. ModelMesh won't scale the model to fewer than minimumCopies
.
Describe alternatives you have considered
Additional context