text-embeddings-inference
text-embeddings-inference copied to clipboard
Feature Request: Multi-GPU inference or the ability to choose a GPU at startup
Feature request
Hello,
Thank you for releasing this inference server!
I have two requests, either of which would solve my specific problem:
- Ability to specify which GPU to use when starting the TEI server
- Alternatively ability to use all/N GPUs with TEI server load balancing traffic to them
Motivation
Currently, TEI can only support running inference on a single GPU. The advice I found in another issue here was to spin up multiple docker containers and assign the GPU to use.
In some environments such as P2P GPU services (i.e vastai) the compute resource is a docker container without access to the host itself, I'm unable to spin up multiple containers to make use of multiple GPUs.
When starting multiple instances of TEI, they all use the first GPU. Adding a CLI argument to specify the GPU id/index would solve this issue.
An alternative would be to use All/N GPUs as specified via a CLI flag, and TEI itself would handle load balancing amongst them.
Your contribution
Moral support