infinity
infinity copied to clipboard
Dynamic loading - different models at request time / multiple models
Instead of running an instance per model in the dockerfile. Can a list of models be provided at instantiation and then the model is chosen via the api request. The current API already has model as a parameter.
Interesting idea:
- What parameters would you launch the model with (always the same?)
- Would you prefer to launch multiple models at a time?
- How long would you keep a model "active" before "unloading it"?
- What revision?
- What to do if a user e.g. requests a onnx repo, but the requested model has e.g. no onnx files?
/models -> List all current models {"BAAI/bge":""}
/embedding ->Check if "BAAI/bge" is the list of models. Do not deploy dynamically.
/rerank
/state/load -> "jinaai/embed-v2" -> add to models, add max dynamic ones to
/state/unload -> Chan
Idea: Do not add inside /embedding -> That would be a huge mess. Perhaps Drawbacks:
- what happens with unload if there is requests in process?
- It's hard to preserve the state -> This would be STATEFUL -> How to du that in k8s? What happens if you have a load balancer? What about multiple replicas?
Summary: If this comment gets 10 upvotes, and no futher concerns, I'll build it. Its a heavyweight feature, that I would prefer to move in a separate service.
The simpler way would be not do deal with loading and unloading and require all models fit in VRAM and then you select which one you use in the API call.
So basically add multiple models in the cli at startup?
Exactly!
This is completed. see #13