RAGatouille
RAGatouille copied to clipboard
add_to_index() one/few documents throws error
Hello!
Great work on the tool so far! Really loving it! I have a question, I apologize if it has already been answered. I am trying to add a single document to an index using the add_to_index()
function. However, I get the following error:
RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /project/faiss/faiss/Clustering.cpp:275: Error: 'nx >= k' failed: Number of training points (11) should be at least as large as number of clusters (32)
I looked more into the source code and found that: (colbert.py: add_to_index())
current_len = len(searcher.collection)
new_doc_len = len(new_documents)
new_documents_with_ids = [
{"content": doc, "document_id": new_pid_docid_map[pid]}
for pid, doc in enumerate(new_documents)
if new_pid_docid_map[pid] not in self.pid_docid_map
]
if new_docid_metadata_map is not None:
self.docid_metadata_map = self.docid_metadata_map or {}
self.docid_metadata_map.update(new_docid_metadata_map)
if current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05:
self.index(
[doc["content"] for doc in new_documents_with_ids],
{
pid: doc["document_id"]
for pid, doc in enumerate(new_documents_with_ids)
},
docid_metadata_map=self.docid_metadata_map,
index_name=self.index_name,
max_document_length=self.config.doc_maxlen,
overwrite="force_silent_overwrite",
)
If current_length of the collection + the length of the new collection, it is re-indexing the documents, which might be more efficient than IndexUpdater, but in the re-indexing process, I suspect that only the new documents are being indexed, in this case 1 and so it is throwing the following error. I could be wrong, can you confirm this?
Thank you!
Good catch! It seems like that's indeed what's happening. This got introduced sneakily because the CRUD aspect hasn't gotten a lot of love yet. Will resolve in a PR soon, thank you!
cc @anirudhdharmarajan I'll try to attend to this, but if you'd like to, feel free! (It's about updating the CRUD support to make sure it actually loads up all the existing metadata/pid/etc... and merges the existing collection with the new one before regenerating)
@bclavie I can take this on, I'll have a PR open either today or tomorrow!
Thank you for the excellent works. What's the progress about this issue?
Hey, I've been extremely busy with other items but I'll have this fix ready Friday this week. Apologies for the delay!
Thanks @anirudhdharmarajan! The PR was published as part a post release and now properly as a main release item with 0.0.8.