text-embeddings-inference
text-embeddings-inference copied to clipboard
Why two batching_task are Required?
trafficstars
Feature request
In the concurrent scenario, I tried to reduce a batching_task, and the batchsize of each embed is larger, so that the inference performance is better.In the single-concurrency scenario, the performance does not decrease.
Motivation
Improves inference performance in concurrent scenarios.
Your contribution
Only one batching_task is required.
We use two batching tasks to prefetch. This could be removed by allowing to move the backend to move the tensors to the device asynchronously but this is a simple workaround.