haystack
haystack copied to clipboard
FAISSDocumentStore: delete_documents() does not work as expected when using Flat index
Describe the bug
When using delete_documents(filters={...})
on a FAISSDocumentStore created with Index Flat the vector_ids in the SQL database is out of sync with the FAISS index as the underlying FAISS remove_ids operation does not conserve the ids of the vectors but shifts them. This is described in the FAISS documentation:
Note that there is a semantic difference when removing ids from sequential indexes vs. when removing them from an IndexIVF:
- for sequential indexes (IndexFlat, IndexPQ, IndexLSH), the removal operation shifts the ids of vectors above the removed vector id.
- the IndexIVF and IndexIDMap2 store the ids of vectors explicitly, so the ids of other vectors are not changed.
The FAISSDocumentStore does not take into account this shift. The only "remedy" is to re-index the whole document store which is not practical only deleting a single entry. I suspect the delete operation is OK if a non-sequential index is used but I have not tried it.
Error message N/A
Expected behavior Ideally the vector ids in the SQL database is adjusted to match the new sequence.
Additional context None
To Reproduce The following code demonstrates the problem
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import EmbeddingRetriever
docs = [
{"content": "This is document 1", "meta": {"item_id": 1}},
{"content": "This is document 2", "meta": {"item_id": 2}},
{"content": "This is document 3", "meta": {"item_id": 3}},
{"content": "This is document 4", "meta": {"item_id": 4}},
{"content": "This is document 5", "meta": {"item_id": 5}},
{"content": "This is document 6", "meta": {"item_id": 6}},
{"content": "This is document 7", "meta": {"item_id": 7}},
]
index_path="my_faiss_index.faiss"
document_store = FAISSDocumentStore(embedding_dim=384)
document_store.save(index_path=index_path)
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
model_format="sentence_transformers"
)
document_store.write_documents(documents=docs)
document_store.update_embeddings(retriever)
document_store.save(index_path=index_path)
assert document_store.get_embedding_count() == 7
# Get the data back with embeddings for later comparison
data1 = document_store.get_all_documents(return_embedding=True)
del document_store
import numpy as np
from pprint import pprint
document_store = FAISSDocumentStore.load(index_path=index_path)
# Store the loaded documents for later comparison with the original
data3 = document_store.get_all_documents(return_embedding=True)
# Find and delete the item_id with vector_id equal to 3 - we want to delete something in the middle
for doc in data1:
if doc.meta['vector_id'] == "3":
id_to_delete = doc.meta['item_id']
break
document_store.delete_documents(filters={"item_id": [id_to_delete]})
document_store.save(index_path=index_path)
# Get the documents now that we have deleted a document
data2 = document_store.get_all_documents(return_embedding=True)
# Now compare the document embeddings in data1 (the original) and data2 (after a document was deleted)
def get_emb(documents, item_id):
for doc in documents:
if doc.meta['item_id'] == item_id:
return doc.embedding, doc.meta['vector_id']
return None, None
assert document_store.get_document_count() == 6
assert document_store.get_embedding_count() == 6
# Check to see that all embeddings are still the same
# Depending on actual circumstances one or more of the checks will be false
ids = [1, 2, 3, 4, 5, 6, 7]
ids.pop(ids.index(id_to_delete))
print(ids)
for item_id in ids:
emb_orig, vid_orig = get_emb(data1, item_id)
emb_new, vid_new = get_emb(data2, item_id)
print(item_id, vid_orig, vid_new, np.array_equal(emb_orig, emb_new))
In my case the output of the above is:
[1, 2, 3, 4, 5, 7]
1 5 5 False
2 1 1 True
3 2 2 True
4 4 4 False
5 0 0 True
7 6 6 True
It is a bit surprising that the embeddings match for item_id = 7 but the index 6 is now undefined as there are only 6 items left in the store, so it is most likely just residual data from the original FAISS index. However, both item 1 and 4 have vector ids above 3 which was the vector id of the deleted item and they fail to compare.
To check that the embeddings did indeed match before deleting anything, the original documents can be compared to the re-loaded documents in data3:
ids = [1, 2, 3, 4, 5, 6, 7]
for item_id in ids:
emb_orig, vid_orig = get_emb(data1, item_id)
emb_new, vid_new = get_emb(data3, item_id)
print(item_id, vid_orig, vid_new, np.array_equal(emb_orig, emb_new))
In my case the output was:
1 5 5 True
2 1 1 True
3 2 2 True
4 4 4 True
5 0 0 True
6 3 3 True
7 6 6 True
FAQ Check
- [X] Have you had a look at our new FAQ page?
System:
- OS: Ubuntu 22.04
- GPU/CPU: CPU (Core i7-10700F)
- Haystack version (commit or version number): 1.21.2
- DocumentStore: FAISSDocumentStore
- Reader: N/A
- Retriever: EmbeddingRetriever
Same problem for me. I tried switching to HNSW index as it is also a recommended index option. However, the application crashes as HNSW does not implement the remove_ids function, which is called in line 569 of faiss.py when delete_documents is called.
I did try to use IDMap with Flat, creating the document store with faiss_index_factory_str="Flat,IDMap"), to see if the document store supports deletes when backed by and IDMap, however there the add_with_ids function() is needed instead of add() and there is no logic to support that in the FAISSDocumentStore. Might try to rework the document store to see if it will work. For now, I've just decided to not delete documents.
@freikim Not sure if this issue I raised will help, but for some reason I had a different version of sqlalchemy that was causing me the same pain. https://github.com/deepset-ai/haystack/issues/6457
@jlonge4 I'm not @freikim but since I have the same problem as he/she/* has: I am very sure that your raised issue won't help. This issue reported here is an implementation issue as haystack does not update its SQL store properly when documents are removed from the FAISSDocumentStore. The reason is, that when documents get deleted, the FAISS index changes the IDs of its stored documents which are not updated in the SQL database which is also part of the FAISSDocumentStore.
@hansblafoo Ohhh yeah that's a bit worse 😅 thanks for the clarification