haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Overwriting duplicate documents in a FAISSDocumentStore does not remove overwritten embeddings

Open Pavithree opened this issue 2 years ago • 8 comments

First I created a FAISS document store, then I used DensePassageRetriever and updated the embeddings. Post this step I store the FAISS index. In the next step, I have to update the document store with more documents, so I used document_store.update_embeddings(retriever,update_existing_embeddings = False). Post this I save the FAISS index. However, Now if I try reloading the documents store , document_store = FAISSDocumentStore.load(index_path = path). I get the following error

ValueError: The number of documents present in the SQL database (4) does not match the number of embeddings in FAISS (7). Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.

I expected only the documents without embeddings to be updated on setting update_existing_embeddings = False.

Pavithree avatar Jul 18 '22 06:07 Pavithree

@Pavithree do you use the same sql_url in both cases? Do you use a persistent SQL database? FAISSDocumentStore uses two separate databases (FAISS index for vectors and SQL for document storage). If 7 is the true total number of your documents after the update, and 4 is the updated number, my guess would be that your SQL database hasn't been persisted. If that's not the case please provide a small script so I can reproduce the issue.

tstadel avatar Jul 18 '22 15:07 tstadel

I am saving the index and reloading it from the same location. Here is the steps I followed

#First run
#create a new document store
document_store = FAISSDocumentStore(embedding_dim=128, faiss_index_factory_str="Flat")

doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="vblagoje/dpr-question_encoder-single-lfqa-wiki",
    passage_embedding_model="vblagoje/dpr-ctx_encoder-single-lfqa-wiki",
)

document_store.update_embeddings(retriever)

document_store.save(index_path = 'document_store/FAISS_database/FAISS_database')

Post this I have some additional files to be written to the document store but i would like to update only the new files to be embedded so I use update_existing_embeddings = False while updating the embeddings, below is the code:

#load the same database
document_store = FAISSDocumentStore.load(index_path = 'document_store/FAISS_database/FAISS_database')

doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(dicts)

document_store.update_embeddings(retriever,update_existing_embeddings = False)
document_store.save(index_path = 'document_store/FAISS_database/FAISS_database')

Now when I reload the document store, document_store = FAISSDocumentStore.load(index_path = 'document_store/FAISS_database/FAISS_database')

I get the following error ValueError: The number of documents present in the SQL database (4) does not match the number of embeddings in FAISS (7). Make sure your FAISS configuration file correctly points to the same database that was used when creating the original index.

@tstadel please let me know if I am missing anything

Pavithree avatar Jul 20 '22 10:07 Pavithree

Hi, @Pavithree could you try and run your code with update_existing_embeddings=True instead of False and see if you get the same error?

sjrl avatar Jul 22 '22 10:07 sjrl

Hi @sjrl , I do not get the error with update_existing_embeddings=True.

Pavithree avatar Jul 23 '22 10:07 Pavithree

Okay, I think I've identified an issue that could be causing your problem.

To confirm could you pass the option duplicate_documents='fail' to the second write_documents call. This would look something like this

doc_dir = "path with text files"
dicts = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)
document_store.write_documents(
    dicts,
    duplicate_documents='fail'
)

and see if you get an error message stating that you are writing duplicate documents to the database?

If you have duplicate documents being written to the loaded FAISS database the default behavior is to overwrite the documents that are in the database (duplicate_documents='overwrite'). It appears that this causes the embedding vectors for the overwritten documents to be lost. However, the document_store.get_embedding_count() does not reflect this change.

For example, let's suppose you previously had 50 documents with embeddings in the loaded FAISS database.

  • Next you write 10 new documents to this database. Let's say 3 of the documents are duplicates.
  • After the write_documents function is finished you'll have a total of 57 documents in the database, but with only 47 embeddings. You can check for these embeddings explicitly by running
counter = 0
for doc in document_store.get_all_documents(return_embedding=True):
    if doc.embedding is not None:
        counter += 1
print(counter)
  • However, the document_store.get_embedding_count() will still return 50 in this example. I believe this is because the document_store.faiss_indexes[index] is not updated to reflect the deleted embeddings.

This mismatch causes further problems when calling document_store.update_embeddings() when using update_existing_embeddings = False. However, this problem is avoided when using the option update_existing_embeddings = True because the document_store.faiss_indexes[index] is reset.

@tstadel I'd appreciate your insight into this problem and I'd be happy to share the code and text files I used to reproduce this error.

sjrl avatar Jul 25 '22 12:07 sjrl

@Pavithree while we figure out a fix for this issue I believe if you use the option duplicate_documents='skip' when calling document_store.write_documents() you should be able to avoid the error you were originally getting without needing to recalculate the embeddings of all documents in your database.

sjrl avatar Jul 25 '22 13:07 sjrl

Hi @sjrl, thank you for the detailed explanation. I tried

document_store.write_documents(
    dicts,
    duplicate_documents='fail'
)

This returns error, like you pointed out. Additionally the suggested fix, providing duplicate_documents='skip' when calling document_store.write_documents() fixes the issue.

Pavithree avatar Jul 25 '22 20:07 Pavithree

Hi @Pavithree I'm glad this resolved your issue! I'll be leaving this issue open for the time being because write_documents should also work with duplicate_documents='overwrite'

sjrl avatar Jul 26 '22 07:07 sjrl

To reproduce:

from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack import Document
import os

DOCUMENTS = [Document(content=content) for content in         
                ["My name is Paul and I live in New York",
                "My name is Matteo and I live in Rome",
                "My name is Christelle and I live in Paris",
                "My name is Carla and I live in Berlin",
                "My name is Camila and I live in Madrid"]
            ]

try:
    os.remove('faiss_document_store.db')   
    os.remove('FAISS_database.json')
    os.remove('FAISS_database')
except:
    pass

document_store = FAISSDocumentStore(embedding_dim=768, faiss_index_factory_str="Flat")
document_store.write_documents(DOCUMENTS[:3])
retriever = EmbeddingRetriever(document_store=document_store, embedding_model="sentence-transformers/paraphrase-albert-small-v2") 
document_store.update_embeddings(retriever)
document_store.save(index_path = 'FAISS_database')

document_store = FAISSDocumentStore.load(index_path = 'FAISS_database')
document_store.write_documents(DOCUMENTS)
document_store.update_embeddings(retriever,update_existing_embeddings = False)
document_store.save(index_path = 'FAISS_database')

document_store = FAISSDocumentStore.load(index_path = 'FAISS_database')

I agree with @sjrl's analysis.

IMHO, the problem is that duplicate_documents parameter of write_documents and update_existing_embeddings parameter of update_embeddings act in an uncoordinated way, because they are unaware of each other.

https://github.com/deepset-ai/haystack/blob/c6890c3e867cc5fdcf64431a5c4422764845d995/haystack/document_stores/faiss.py#L284-L290

When you overwrite some existing documents, the field vector_id is set only if the user provides the vectors along with the documents. Otherwise, vector_id is not set, even if the embedding already exists.

anakin87 avatar Nov 25 '22 17:11 anakin87