RAGatouille
RAGatouille copied to clipboard
Inconsistent search results length for high top-k values
Hi, I'm getting an issue similar to #130. The number of returned top-k isn't always as specified (i.e. k=500, len(res) = 4xx, 3xx), this is more the case for the fine-tuned version of colbert-v2.
Tho it's uncommon to have such a high top-k, this is helpful for me when benchmarking and making the function more predictable when used.
- The dataset has 800 docs.
- Model: fine-tuned colbert-ir/colbertv2.0.
Code:
from ragatouille import RAGPretrainedModel
# Indexing
RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
index_path = RAG.index(index_name="my_index", collection=docs, document_ids=doc_ids)
# Retrieving
RAG = RAGPretrainedModel.from_index('.ragatouille/colbert/indexes/finetuned_index')
results = RAG.search(query, k=500)
print(len(results))
# -> 500, 491, 413, 3xx, ....
Thanks for the help!
Hey! This isn't a full solution to your problem (which is basically due to how the optimised retrieval engine works, and the defaults/dynamic hyper parameters not being very strong for small collections), but I think for just ~800 documents for benchmarking purposes you could alleviate this issue is by using in-memory encoding rather than indexing. (until I build a proper nice HNSW-style index, I'm also planning on letting users create an "index" by persisting their in-memory encoding, which will work really well for relatively low number of documents!)
E.g. in your situation, replace
RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
index_path = RAG.index(index_name="my_index", collection=docs, document_ids=doc_ids)
# Retrieving
RAG = RAGPretrainedModel.from_index('.ragatouille/colbert/indexes/finetuned_index')
results = RAG.search(query, k=500)
with
RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
RAG.encode(docs)
results = RAG.search_encoded_docs(query, k=500)
This will actively search through every single document rather than PLAID-style approximation, which for small datasets + high k values will always guarantee that you get the number of results you want, and the computational overhead is minimal at your data scale (on my machine, it takes ~45ms to query the index, and ~55 to query in-memory encoded docs)
Thanks, that works well! Small detail but I think it'd be nice to add document_ids
to RAG.encode
similar to how it's done with RAG.index
so that both can have the same result format.
Hey, this will come along with https://github.com/bclavie/RAGatouille/pull/137 (as well as making full-vectors indexing the default index for small collections)!
Hey, this will come along with #137 (as well as making full-vectors indexing the default index for small collections)!
hi, i meet the same problem, could it support document_ids
for RAG.encode
now?