FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

Performance for BGE-M3 inference dropped between 1.2.x and 1.3.x

Open ivlcic opened this issue 1 year ago • 1 comments

Using code from 1.2.x and 1.3.x, up to 100% performance regression occurs during inference. The performance degrades in subsequent calls to model.encode; M3Embedder.encode_single_device is 2x slower than the original 1.2.x code.

One obvious suggestion is to remove the following from the encode_single_device function:

self.model.to(device)
self.model.eval()

The second observation is that

self.model(...

is now invoked at least two times instead of once just to adjust batch size on Error?

best, Nikola

ivlcic avatar Dec 27 '24 16:12 ivlcic

I'm experiencing the same thing.

v1.2.11 Average response time: 71.96ms Throughput: 56.90 requests/sec

v1.3.4 Average response time: 92.14ms Throughput: 49.05 requests/sec

(RTX 3060, Code)

neuwcodebox avatar Apr 02 '25 11:04 neuwcodebox