[Discussion] Distribute multiple models across different MLServer instances
Hi! I would like to open a discussion to gather some advices, in case anyone has had similar experiences or has better ideas than mine.
I am currently facing a use case where we need to deploy multiple computer vision models that perform different tasks for different type of products. For example, we will run a different anomaly detection model for each side of a product (front, back, left, right, top, bottom).
The idea is that we will develop an API gateway which will receive the image from the client and some parameters such as the side (e.g. front), and the type of product under inspection. The API gateway will the retrieve a model ID according to the parameters received: for example, if we receive front as a side and product_a as the type of product, we will make a lookup and retrieve the ID of the model that has to be used for that side/product. The gateway will then redirect the inference request to a custom MLServer which will run the inference and return the result to the gateway which, in turn, will then return the result to the client.
I have successfully implemented a simple POC locally, where the API gateway is a FastAPI application, and the model server uses MLServe. Currently, the model server simply loads all the models that it finds locally. However, the problem is that in production we are going to have a lot of such models, which cannot all be hosted and loaded into the same service. The ideal solution would be to have multiple replicas of the model server based on MLServe and distribute the models across these replicas. Of course, this probably has to be done in Kubernetes or similar.
Perhaps I should this question on KServe's github, in case I apologize, however I would like to ask if anyone has ever faced a similar issue. Do you know if there's a way to distribute different models across different instances of an MLServe model server according to some kind of configuration?
Thanks a lot in advance!
Multiple MLServer replicas (with multiple models loaded on each replica) can be orchestrated in many ways, but Seldon proposes Seldon Core as the "canonical" option (with Core and MLServer being co-developed and maintained by Seldon). For production use-cases, Seldon Core is typically deployed inside a Kubernetes cluster but Core itself can also run outside Kubernetes. See the docs here for more info.
The overall approach taken in Core is this:
- You upload the models that you need to deploy to cloud buckets
- In the Kubernetes cluster where Core is installed, you define "Model" CRs for each model you would like to deploy, with a field of the CR pointing to the path to the cloud bucket where that model resides
- Within the same cluster you create a number of inference servers (MLserver or NVIDIA Triton), each potentially with many replicas, with sufficient resources allocated to be able to serve all your models. Each inference server replica (a kubernetes pod) can serve multiple models simultaneously, just as a local MLServer instance would.
- With the above defined, the Core scheduler will automatically take care of matching the specified Models to the available inference servers, the download from cloud buckets into local inference server storage, and the dynamic loading/unloading of the models within those servers.
- Each model may be configured with multiple replicas (each replica gets loaded onto a different inference server replica) so that you can elastically scale each model based on load.
- You can deploy new versions of models as needed (by modifying the Model CR to point to a new bucket URL), and the Core scheduler will take care of rolling out the new model versions onto the inference servers, without downtime.
As you briefly mention, KServe is also an option, and you'll get an unbiased opinion from other comments here (and you can obviously also post the question directly to the KServe repository). In comparing the two, you should consider, amongst others:
- licensing requirements depending on the type of use (Seldon Core is released under BSL)
- resource needs of the Kubernetes cluster (KServe typically deploys one container per model, and you would need to use a ModelMesh deployment to get higher density of models per available resources -- this also limits the way elastic scaling works; In comparison, Core will by default deploy multiple models within the same MLServer container and allows for the independent scaling of those across multiple inference server replicas).
- whether you will want to ever build complex processing pipelines involving multiple models (i.e for each inference request, you want to move data across multiple ML models deployed independently), with Core offering builtin support for that and more flexibility wrt what your pipelines can do and how resilient they are to component failures -- here, what works best will entirely depend on what you will want to achieve