FlagEmbedding
FlagEmbedding copied to clipboard
Performance for BGE-M3 inference dropped between 1.2.x and 1.3.x
Using code from 1.2.x and 1.3.x, up to 100% performance regression occurs during inference.
The performance degrades in subsequent calls to model.encode; M3Embedder.encode_single_device is 2x slower than the original 1.2.x code.
One obvious suggestion is to remove the following from the encode_single_device function:
self.model.to(device)
self.model.eval()
The second observation is that
self.model(...
is now invoked at least two times instead of once just to adjust batch size on Error?
best, Nikola
I'm experiencing the same thing.
v1.2.11 Average response time: 71.96ms Throughput: 56.90 requests/sec
v1.3.4 Average response time: 92.14ms Throughput: 49.05 requests/sec
(RTX 3060, Code)