haystack
haystack copied to clipboard
Overwriting duplicate documents in a FAISSDocumentStore does not remove overwritten embeddings
First I created a FAISS document store, then I used DensePassageRetriever and updated the embeddings. Post this step I store the FAISS index. In the next step, I have to update the document store with more documents, so I used document_store.update_embeddings(retriever,update_existing_embeddings = False). Post this I save the FAISS index. However, Now if I try reloading the documents store , document_store = FAISSDocumentStore.load(index_path = path). I get the following error
ValueError: The number of documents present in the SQL database (4) does not match the number of embeddings in FAISS (7). Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.
I expected only the documents without embeddings to be updated on setting update_existing_embeddings = False.
@Pavithree do you use the same sql_url
in both cases? Do you use a persistent SQL database?
FAISSDocumentStore
uses two separate databases (FAISS index for vectors and SQL for document storage). If 7 is the true total number of your documents after the update, and 4 is the updated number, my guess would be that your SQL database hasn't been persisted.
If that's not the case please provide a small script so I can reproduce the issue.
I am saving the index and reloading it from the same location. Here is the steps I followed
#First run
#create a new document store
document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str="Flat")
doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)
document_store.update_embeddings(retriever)
document_store.save(index_path = 'document_store/FAISS_database/FAISS_database')
Post this I have some additional files to be written to the document store but i would like to update only the new files to be embedded so I use update_existing_embeddings = False while updating the embeddings, below is the code:
#load the same database
document_store = FAISSDocumentStore.load(index_path = 'document_store/FAISS_database/FAISS_database')
doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)
document_store.update_embeddings(retriever,update_existing_embeddings = False)
document_store.save(index_path = 'document_store/FAISS_database/FAISS_database')
Now when I reload the document store,
document_store = FAISSDocumentStore.load(index_path = 'document_store/FAISS_database/FAISS_database')
I get the following error ValueError: The number of documents present in the SQL database (4) does not match the number of embeddings in FAISS (7). Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.
@tstadel please let me know if I am missing anything
Hi, @Pavithree could you try and run your code with update_existing_embeddings=True
instead of False
and see if you get the same error?
Hi @sjrl , I do not get the error with update_existing_embeddings=True.
Okay, I think I've identified an issue that could be causing your problem.
To confirm could you pass the option duplicate_documents='fail'
to the second write_documents
call. This would look something like this
doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(
dicts,
duplicate_documents='fail'
)
and see if you get an error message stating that you are writing duplicate documents to the database?
If you have duplicate documents being written to the loaded FAISS database the default behavior is to overwrite the documents that are in the database (duplicate_documents='overwrite'
). It appears that this causes the embedding vectors for the overwritten documents to be lost. However, the document_store.get_embedding_count()
does not reflect this change.
For example, let's suppose you previously had 50 documents with embeddings in the loaded FAISS database.
- Next you write 10 new documents to this database. Let's say 3 of the documents are duplicates.
- After the
write_documents
function is finished you'll have a total of 57 documents in the database, but with only 47 embeddings. You can check for these embeddings explicitly by running
counter = 0
for doc in document_store.get_all_documents(return_embedding=True):
if doc.embedding is not None:
counter += 1
print(counter)
- However, the
document_store.get_embedding_count()
will still return 50 in this example. I believe this is because the document_store.faiss_indexes[index] is not updated to reflect the deleted embeddings.
This mismatch causes further problems when calling document_store.update_embeddings()
when using update_existing_embeddings = False
. However, this problem is avoided when using the option update_existing_embeddings = True
because the document_store.faiss_indexes[index]
is reset.
@tstadel I'd appreciate your insight into this problem and I'd be happy to share the code and text files I used to reproduce this error.
@Pavithree while we figure out a fix for this issue I believe if you use the option duplicate_documents='skip'
when calling document_store.write_documents()
you should be able to avoid the error you were originally getting without needing to recalculate the embeddings of all documents in your database.
Hi @sjrl, thank you for the detailed explanation. I tried
document_store.write_documents(
dicts,
duplicate_documents='fail'
)
This returns error, like you pointed out.
Additionally the suggested fix, providing duplicate_documents='skip'
when calling document_store.write_documents()
fixes the issue.
Hi @Pavithree I'm glad this resolved your issue! I'll be leaving this issue open for the time being because write_documents
should also work with duplicate_documents='overwrite'
To reproduce:
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack import Document
import os
DOCUMENTS = [Document(content=content) for content in
["My name is Paul and I live in New York",
"My name is Matteo and I live in Rome",
"My name is Christelle and I live in Paris",
"My name is Carla and I live in Berlin",
"My name is Camila and I live in Madrid"]
]
try:
os.remove('faiss_document_store.db')
os.remove('FAISS_database.json')
os.remove('FAISS_database')
except:
pass
document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
document_store.write_documents(DOCUMENTS[:3])
retriever = EmbeddingRetriever(document_store=document_store, embedding_model="sentence-transformers/paraphrase-albert-small-v2")
document_store.update_embeddings(retriever)
document_store.save(index_path = 'FAISS_database')
document_store = FAISSDocumentStore.load(index_path = 'FAISS_database')
document_store.write_documents(DOCUMENTS)
document_store.update_embeddings(retriever,update_existing_embeddings = False)
document_store.save(index_path = 'FAISS_database')
document_store = FAISSDocumentStore.load(index_path = 'FAISS_database')
I agree with @sjrl's analysis.
IMHO, the problem is that duplicate_documents
parameter of write_documents
and update_existing_embeddings
parameter of update_embeddings
act in an uncoordinated way, because they are unaware of each other.
https://github.com/deepset-ai/haystack/blob/c6890c3e867cc5fdcf64431a5c4422764845d995/haystack/document_stores/faiss.py#L284-L290
When you overwrite some existing documents, the field vector_id
is set only if the user provides the vectors along with the documents. Otherwise, vector_id
is not set, even if the embedding already exists.