text-embeddings-inference icon indicating copy to clipboard operation
text-embeddings-inference copied to clipboard

Why two batching_task are Required?

Open sulude opened this issue 1 year ago • 1 comments
trafficstars

Feature request

In the concurrent scenario, I tried to reduce a batching_task, and the batchsize of each embed is larger, so that the inference performance is better.In the single-concurrency scenario, the performance does not decrease.

Motivation

Improves inference performance in concurrent scenarios.

Your contribution

Only one batching_task is required.

sulude avatar Apr 28 '24 01:04 sulude

image

sulude avatar Apr 28 '24 01:04 sulude

We use two batching tasks to prefetch. This could be removed by allowing to move the backend to move the tensors to the device asynchronously but this is a simple workaround.

OlivierDehaene avatar Jun 17 '24 14:06 OlivierDehaene