vod icon indicating copy to clipboard operation
vod copied to clipboard

🔮 Project: Multi-Vector Retrieval (ColBERT)

Open MotzWanted opened this issue 1 year ago • 0 comments

WHY Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured.

ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.

The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.

HOW The project will address the aforementioned goals through the following means:

Utilizing Fine-Grained Contextual Late Interaction:

  • Leverage ColBERT's ability to encode queries and passages into sequences of token-level embeddings.
  • Improve vod's on-disk data structures to handle 3-dimensional tensors with variable shapes (e.g., shape N x ? x H)
  • Implement ColBERT's MaxSim operator in the loss layer
  • Implement ColBERT's two-stage retrieval

Combine T5 Models with ColBERT:

  • Benchmark ColT5 against ColBERT
  • Benchmark the end-to-end search latency in search engine like Raffle.

Implement XTR: ContXextualized Token Retriever:

  • Implement XTR loss
  • Implement XTR one-stage retrieval

Refinements:

  • Investigate Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.
  • Explore effective and efficient retrieval via Lightweight Late Interaction (e.g., PLAID)

WHAT The anticipated outcomes of this project include:

  1. State-of-the-art retrieval for RAG models (T5 + XTR)
  2. A scalable solution capable of handling large corpora without compromising efficiency.

References

MotzWanted avatar Aug 14 '23 13:08 MotzWanted