infinity Dynamic loading - different models at request time / multiple models

Dynamic loading - different models at request time / multiple models

Open cduk opened this issue 1 year ago • 4 comments

Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The current API already has model as a parameter.

Mar 17 '24 00:03 cduk

Interesting idea:

What parameters would you launch the model with (always the same?)
Would you prefer to launch multiple models at a time?
How long would you keep a model "active" before "unloading it"?
What revision?
What to do if a user e.g. requests a onnx repo, but the requested model has e.g. no onnx files?

/models -> List all current models {"BAAI/bge":""} 
/embedding ->Check if  "BAAI/bge" is the list of models. Do not deploy dynamically.
/rerank
/state/load -> "jinaai/embed-v2" -> add to models, add max dynamic ones to
/state/unload -> Chan

Idea: Do not add inside /embedding -> That would be a huge mess. Perhaps Drawbacks:

what happens with unload if there is requests in process?
It's hard to preserve the state -> This would be STATEFUL -> How to du that in k8s? What happens if you have a load balancer? What about multiple replicas?

Summary: If this comment gets 10 upvotes, and no futher concerns, I'll build it. Its a heavyweight feature, that I would prefer to move in a separate service.

Mar 17 '24 01:03 michaelfeil

The simpler way would be not do deal with loading and unloading and require all models fit in VRAM and then you select which one you use in the API call.

Mar 17 '24 20:03 cduk

So basically add multiple models in the cli at startup?

Mar 17 '24 21:03 michaelfeil

Exactly!

Mar 18 '24 07:03 cduk

This is completed. see #13

Jul 22 '24 00:07 michaelfeil

infinity infinity copied to clipboard

Dynamic loading - different models at request time / multiple models

infinity
infinity copied to clipboard