Performance for BGE-M3 inference dropped between 1.2.x and 1.3.x

Open ivlcic opened this issue 1 year ago • 1 comments

Using code from 1.2.x and 1.3.x, up to 100% performance regression occurs during inference. The performance degrades in subsequent calls to model.encode; M3Embedder.encode_single_device is 2x slower than the original 1.2.x code.

One obvious suggestion is to remove the following from the encode_single_device function:

self.model.to(device)
self.model.eval()

The second observation is that

self.model(...

is now invoked at least two times instead of once just to adjust batch size on Error?

best, Nikola

Dec 27 '24 16:12 ivlcic

I'm experiencing the same thing.

v1.2.11 Average response time: 71.96ms Throughput: 56.90 requests/sec

v1.3.4 Average response time: 92.14ms Throughput: 49.05 requests/sec

(RTX 3060, Code)

Apr 02 '25 11:04 neuwcodebox