RAGatouille add_to_index() one/few documents throws error

Hello!

Great work on the tool so far! Really loving it! I have a question, I apologize if it has already been answered. I am trying to add a single document to an index using the add_to_index() function. However, I get the following error:

RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::idx_t, const uint8_t*, const faiss::Index*, faiss::Index&, const float*) at /project/faiss/faiss/Clustering.cpp:275: Error: 'nx >= k' failed: Number of training points (11) should be at least as large as number of clusters (32)

I looked more into the source code and found that: (colbert.py: add_to_index())

 current_len = len(searcher.collection)
        new_doc_len = len(new_documents)
        new_documents_with_ids = [
            {"content": doc, "document_id": new_pid_docid_map[pid]}
            for pid, doc in enumerate(new_documents)
            if new_pid_docid_map[pid] not in self.pid_docid_map
        ]

        if new_docid_metadata_map is not None:
            self.docid_metadata_map = self.docid_metadata_map or {}
            self.docid_metadata_map.update(new_docid_metadata_map)

        if current_len + new_doc_len < 5000 or new_doc_len > current_len * 0.05:
            self.index(
                [doc["content"] for doc in new_documents_with_ids],
                {
                    pid: doc["document_id"]
                    for pid, doc in enumerate(new_documents_with_ids)
                },
                docid_metadata_map=self.docid_metadata_map,
                index_name=self.index_name,
                max_document_length=self.config.doc_maxlen,
                overwrite="force_silent_overwrite",
            )

If current_length of the collection + the length of the new collection, it is re-indexing the documents, which might be more efficient than IndexUpdater, but in the re-indexing process, I suspect that only the new documents are being indexed, in this case 1 and so it is throwing the following error. I could be wrong, can you confirm this?

Thank you!

Jan 26 '24 01:01 AdityaPratap1

Good catch! It seems like that's indeed what's happening. This got introduced sneakily because the CRUD aspect hasn't gotten a lot of love yet. Will resolve in a PR soon, thank you!

Jan 26 '24 12:01 bclavie

cc @anirudhdharmarajan I'll try to attend to this, but if you'd like to, feel free! (It's about updating the CRUD support to make sure it actually loads up all the existing metadata/pid/etc... and merges the existing collection with the new one before regenerating)

Jan 27 '24 14:01 bclavie

@bclavie I can take this on, I'll have a PR open either today or tomorrow!

Jan 28 '24 00:01 adharm

Thank you for the excellent works. What's the progress about this issue?

Feb 20 '24 08:02 kevinningthu

Hey, I've been extremely busy with other items but I'll have this fix ready Friday this week. Apologies for the delay!

Feb 20 '24 15:02 adharm

Thanks @anirudhdharmarajan! The PR was published as part a post release and now properly as a main release item with 0.0.8.

Mar 18 '24 20:03 bclavie

RAGatouille RAGatouille copied to clipboard

add_to_index() one/few documents throws error

RAGatouille
RAGatouille copied to clipboard