MLServer Lazy loading of models

trafficstars

Currently, models in MLServer can only be fully loaded or unloaded. That is, they either are fully loaded on memory, or MLServer knows nothing about them. In order to save resource usage (mainly memory), it could be interesting to also allow models to be "known" to MLServer (i.e. aware of their ModelSettings config) but not yet loaded on memory. In other words, only partly loaded. Upon the first incoming request, MLServer would then finalise loading the model by calling the model.load() method.

This would be equivalent to "lazy-loading" models, which would be necessary so that we can also partly-unloading on demand (covered in #329).

Oct 07 '21 08:10 adriangonz

Hi ! @adriangonz Is it the role of MLServer to update (lazy load or unload) models ? Even if we can unload as it is explained in "Loading / unloading models from a model repository" section, it seems MLServer launch an instance for each model (or group of models in multi-model serving case) to serve. Could tools like KServe be a better choice to deal with load/unload models ? Difficult to understand the scope of MLServer and when to use another tool.

Nov 26 '21 15:11 indyMccarthy

Hey @indyMccarthy ,

MLServer is one of the core inference servers used by KServe. That is, part of the load / unload functionality exposed by KServe is implemented at the MLServer level, with the other part being extra logic added by KServe at the "infra" / K8s level (e.g. load balancing across multiple nodes, unloading of unused models, etc.).

Having said that, MLServer is also used outside of KServe (e.g. in Seldon Core, locally or deployed and managed separately). Therefore, features like lazy loading can also be interesting for these other use cases.

Does that help to clarify the role of MLServer vs KServe?

Nov 26 '21 15:11 adriangonz

To put another way, MLServer deals with issues at the "model serving" and "inference" level, whereas KServe provides the glue with K8s as well as a other features at the "infra" level.

Nov 26 '21 15:11 adriangonz

@adriangonz Thank you for your answer 🙂 ! I thought KServe was more than just the "glue" and had a role in the serving part.

However, as I wonder previously, do we have to launch an MLServer instance for each model to serve it ? (Not sure about that because of this doc --> https://github.com/SeldonIO/MLServer/blob/master/docs/examples/mms/README.md)

Or, does a single instance of MLServer has to be used for multiple usecases, model instance or model version ? Besides, I did not find a way to upgrade a model when you want to switch from a V1 to a V2 without restart the MLServer instance (maybe lazy loading is exactly the point to reach that ? Or you have to load V2 then unload V1 as explained here https://mlserver.readthedocs.io/en/latest/examples/model-repository/README.html with http POST methods)

This issue makes me suppose that we want a single instance dealing with many usecases and models. But in the same time, with a single instance we have a SPOF. I suppose we want multiple MLServer instances dealing with multiples usecases.

So, for example, you have multiple MLServer instances dealing with multiple models/usecases, running on Kubernetes with KServe. Each ones will run "Lazy Loading" (or unload) independently, so how to be sure that all instance load the new model ? Moreover, I suppose we can't use things like Blue/Green deployement at model level but MLServer instance level (which deals with other models that we don't want to change).

So, 1 MLServer instance for 1 model (finally a simple worker that we can run alone locally) makes more sense to me in a scalable environment and easily deals with problems like synchronization. Are lazy loading or multi-model serving specific to local/dev usage ?

EDIT: Just read the following page https://kserve.github.io/website/modelserving/mms/multi-model-serving/ so understance why 'one model, one server' paradigm causes issues.

Nov 27 '21 12:11 indyMccarthy

Hey @indyMccarthy ,

That's a great summary of different designs points that have to be considered when designing a serving architecture.

Regarding those, MLServer tries to be as unopinionated as possible and instead be just a tool which can be used in multiple ways. For example, KServe tries to leverage MLServer as a multi-model server, handling an extra-level of abstraction on top of Kubernetes. Whereas, someone else deploying MLServer directly (or embedding it within a different serving solution) could prefer to follow the 1-model / 1-server paradigm. Both should be valid approaches.

As you link at the end of your comment, neither of these design points seem to have a "silver bullet" solution and instead each one just poses its own set of trade-offs.

Nov 29 '21 11:11 adriangonz

For the medium term, we think lazy loading is better handled at the orchestrator layer (e.g. Seldon Core V2, KServe, etc.). Therefore, won't be tackling this one anytime soon.

Mar 24 '23 16:03 adriangonz

MLServer MLServer copied to clipboard

Lazy loading of models

MLServer
MLServer copied to clipboard