RAGatouille icon indicating copy to clipboard operation
RAGatouille copied to clipboard

Discrepancy in CPU Inference Latency: Cross-Encoder MiniLM Models vs. ColBERT

Open gaceladri opened this issue 10 months ago • 1 comments

Greetings :wave:

I've been benchmarking the CPU inference latency for various models and observed some significant differences. Specifically, I'm comparing the performance of the sentence_transformers 'cross-encoder/ms-marco-MiniLM-L-12-v2' with other models. The latency for the top 10 re-ranking seems to vary quite a bit, and I'm trying to understand if this is an expected behavior or if there might be an issue with my setup. For clarity, here's a quick summary of the latencies I've recorded:

  • ColBERT: 620 ms
  • Cross-encoder MiniLM-L-12-v2: 309 ms
  • Cross-encoder MiniLM-L-6-v2: 150 ms I've attached a visual representation of these findings for reference:

image (5)

Could someone please shed some light on this? Is there a particular reason why the ColBERT model has over double the latency of MiniLM-L-12-v2? Any insights or suggestions for improving the inference speed of ColBert on CPU would be greatly appreciated.

gaceladri avatar Apr 10 '24 06:04 gaceladri

I think this isn't shocking given how reranking with ColBERT works, though I'd expect it to be a bit quicker. The main factor at play here is that MiniLM-L-12 is just ~35M parameters (L-6 is around 22M for comparison), whereas ColBERTv2 is 110, so a bit more than 3 times as big, which'll explain why it runs a lot slower comparatively.

The strenght of ColBERT as a reranker however is that you can pre-compute document representations in advance, which you cannot do with cross-encoders. In such a set-up it'd run noticeably quicker than cross-enc alternatives!

bclavie avatar Apr 12 '24 13:04 bclavie