Multimodel inference
Will be great if we could load more models in the same container and switch between them using model name
Great suggestion. I already thought about this, but it makes the server harder to monitor / measure throughput etc, especially the /ready and /metrics routes.
For now, I would suggest starting up two servers at the same time using /bin/sh -c .. logic.
Docker
For Docker you can do e.g.
# Dockerfile for multiple models via multiple ports
FROM michaelf34/infinity:latest
ENTRYPOINT ["/bin/sh", "-c", \
"(/opt/poetry/bin/poetry run infinity_emb --port 8080 --model-name-or-path sentence-transformers/all-MiniLM-L6-v2 &);\
(/opt/poetry/bin/poetry run infinity_emb --port 8081 --model-name-or-path intfloat/e5-large-v2 )"]
You can run it via
docker build -t custominfinity . && docker run -it -p 8080:8080 -p 8081:8081 custominfinity
Dont forget to add GPUs / cache dir etc.
FYI, there is now a high level python API. You can build now your own FastAPI server, according to your needs.
@bacoco Taking suggestions for design patterns for this feature. If there is a good one or the feature is very wanted, I might implement it #151
I think launching this from command-line is basically not possible.
It would be launching multiple models with typer.cli where most args are auto-generated.
Do you think a cli-setup from .yaml would be a good option? @bacoco
Hey all, its fully implemented. @bacoco