haystack
haystack copied to clipboard
Can't specify index during TF_IDF fit
Not sure if this is a missing feature, bug, or a conceptual error on my part. I am using a FAISS document store that needs to support multiple indexes. After writing all of my documents with the specified index, I call the update_embeddings method with a DPR retriever and the same index. This seems to work fine. Next, I try and call the fit method on the TF_IDF retriever to support sparse retrieval, but get an error showing 0 documents processed. This seems to be because the fit method does not take any parameters, but calls the get_all_documents() method which does take an index parameter. The simple fix seems to be adding an optional index parameter to the TF_IDFRetreiver fit method.
👋 Please report the code, so we can have a reproducible example and help you.
Of course. Here is my document ingestion pipeline. Everything seems to work fine except the .fit() call
def add(files, index):
make_temp_dir(files)
docs = convert_files_to_docs(temp_dir, clean_func=clean_wiki_text)
processed = preprocessors['default'].process(docs)
document_store.write_documents(processed, index=index)
document_store.update_embeddings(
retriever = retrievers['dpr'],
update_existing_embeddings = False,
index=index
)
retrievers['tf_idf'].fit() # this is broken because the documents are stored under an index
destoy_temp_dir()
The .fit() call in turn calls the document_store.get_all_documents() method here which if not given an index will default to the document store's self.index, which is not correct in this case.
Hi @rdubester thanks for the code snippet! Could you also provide us with the code you use to create retrievers
and preprocessors
?
Hey @rdubester let us know if you solved or worked around the issue somehow. If it's solved I would close the issue.
Probably related to #1634.
I'm closing this, @rdubester feel free to reopen if this is still an issue with the latest Haystack release