vod 🔮 Project: Multi-Vector Retrieval (ColBERT)

🔮 Project: Multi-Vector Retrieval (ColBERT)

Open MotzWanted opened this issue 1 year ago • 0 comments

WHY Currently, VOD Training only complies with document-level embeddings. This represents each document with a single-vector representation, constraining the granularity of the contextual information captured.

ColBERT introduced a more complex interaction by encoding each passage into a matrix of token-level embeddings. During search, it further embeds every query into another matrix, allowing efficient passage retrieval that contextually matches the query using scalable vector-similarity operators.

The rich interactions enabled by ColBERT have been proven to surpass the quality of single-vector representation models. However, making it scale efficiently to large corpora is not trivial.

HOW The project will address the aforementioned goals through the following means:

Utilizing Fine-Grained Contextual Late Interaction:

Leverage ColBERT's ability to encode queries and passages into sequences of token-level embeddings.
Improve vod's on-disk data structures to handle 3-dimensional tensors with variable shapes (e.g., shape N x ? x H)
Implement ColBERT's MaxSim operator in the loss layer
Implement ColBERT's two-stage retrieval

Combine T5 Models with ColBERT:

Benchmark ColT5 against ColBERT
Benchmark the end-to-end search latency in search engine like Raffle.

Implement XTR: ContXextualized Token Retriever:

Implement XTR loss
Implement XTR one-stage retrieval

Refinements:

Investigate Robust Multi-Hop Reasoning at Scale via Condensed Retrieval.
Explore effective and efficient retrieval via Lightweight Late Interaction (e.g., PLAID)

WHAT The anticipated outcomes of this project include:

State-of-the-art retrieval for RAG models (T5 + XTR)
A scalable solution capable of handling large corpora without compromising efficiency.

References

Aug 14 '23 13:08 MotzWanted

vod vod copied to clipboard

🔮 Project: Multi-Vector Retrieval (ColBERT)

vod
vod copied to clipboard