How to delete or update a document within a FAISS index?
Hi,
I have a usecase where i have to fetch Edited posts weekly from community and update the docs within FAISS index. is that possible? or do i have to keep deleting and create new index everytime?
Also i use RecursiveCharacterTextSplitter to split docs.
loader = DirectoryLoader('./recent_data')
raw_documents = loader.load()
#Splitting documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
documents = text_splitter.split_documents(raw_documents)
print(len(documents))
# Changing source to point to the original document
for x in documents:
print(x.metadata["source"])
# Creating index and saving it to disk
print("Creating index")
db_new = FAISS.from_documents(documents, embeddings )
this is output if i use print(db_new .docstore._dict)
{'2d9b6fbf-a44d-46b5-bcdf-b45cd9438a4c': Document(page_content='<p dir="auto">This is a test topic.</p>', metadata={'source': 'recent/https://community.tpsonline.com/topic/587/ignore-test-topic'}), '706dcaf8-f9d9-45b9-bdf4-8a8ac7618229': Document(page_content='What is an SDD?\n\n<p dir="auto">A software design description (a.k.a. software design document or SDD; just design document; also Software Design Specification) is a representation of a software design that is to be used for recording design information, addressing various design concerns, and communicating that information to the different stakeholders.</p>\n\n<p dir="auto">This SDD template represent design w.r.t various software viewpoints, where each viewpoint will handle specific concerns of Design. This is based on <strong>ISO 42010 standard</strong>.</p>\n\nIntroduction\n\n<p dir="auto">[Name/brief description of feature for which SDD is being Produced]</p>\n\n1. Context Viewpoint\n\n<p dir="auto">[Describes the relationships, dependencies, and interactions between the system and its environment ]</p>\n\n1.1 Use Cases\n\n1.1.1 AS IS (Pre Condition)\n\n1.1.2 TO - BE (Post Condition)\n\n1.2 System Context View\n\n1.2.1 - AS IS (Pre Condition)\n\n1.2.2 TO - BE (Post Condition)\n\n2. Logical Viewpoint', metadata={'source': 'recent/https://community.tpsonline.com/topic/586/software-design-description-sdd-template'}), '4d6d4e6b-01ee-46bb-ae06-84514a51baf2': Document(page_content='1.1 Use Cases\n\n1.1.1 AS IS (Pre Condition)\n\n1.1.2 TO - BE (Post Condition)\n\n1.2 System Context View\n\n1.2.1 - AS IS (Pre Condition)\n\n1.2.2 TO - BE (Post Condition)\n\n2. Logical Viewpoint\n\n<p dir="auto">[The purpose of the Logical viewpoint is to elaborate existing and designed types and their implementations as classes and interfaces with their structural static relationships]</p>\n\n2.1 Class Diagram\n\n2.1.1 AS - IS (Pre Condition)\n\n2.1.2 TO - BE (Post Condition)\n\n2.1.2.1 Class Interfaces and description\n\n<p dir="auto">[Below is being presented as an example]<br />\n\n[This section should tell about the responsibility of each class method and their parameters too if required]</p>\n\n2.1.2.1.1 IRenewProcess\n\nMethod\n\nDescription\n\nprocessRenewal\n\nMethod to process renewal of a given cardEntity. Each concrete class that will implement the interface will implement its own version of renewal steps\n\n2.1.2.1.1 RenewStrategyContext (static class)\n\nMethod\n\nDescription\n\n(private)getRenewalMethod', metadata={'source': 'recent/https://community.tpsonline.com/topic/586/software-design-description-sdd-template'})}
so will i be able to update docs within index or is it just not possible?
I'm also interested in a way to update docs that are already in a vectorstore
I went and asked Kapa.ai, and here's the answer:
To update an existing FAISS vector store with a new version of your document, you can follow these steps:
- Remove the old version of the document from the vector store (if it's stored in the docstore).
- Create embeddings for the new version of the document.
- Add the new embeddings and the updated document to the vector store.
Here's an example using the add_texts method from the Python documentation(https://python.langchain.com/en/latest/reference/modules/vectorstore.html):
# Assuming you have an existing FAISS vector store `db` and an updated document text
updated_text = "your_updated_document_text_here"
# Remove the old version of the document from the vector store (if necessary)
# You may need to implement this step depending on your specific use case
# Create embeddings for the new version of the document
# This step may vary depending on the embeddings model you're using
# Add the new embeddings and the updated document to the vector store
db.add_texts([updated_text])
Keep in mind that this example assumes you have already set up a FAISS vector store (db) and have an embeddings model. If you need more information on setting up a FAISS vector store, you can refer to the
Python documentation
(https://python.langchain.com/en/latest/reference/modules/vectorstore.html).
I went and asked Kapa.ai, and here's the answer:
To update an existing FAISS vector store with a new version of your document, you can follow these steps:
- Remove the old version of the document from the vector store (if it's stored in the docstore).
- Create embeddings for the new version of the document.
- Add the new embeddings and the updated document to the vector store.
Here's an example using the
add_textsmethod from the Python documentation(https://python.langchain.com/en/latest/reference/modules/vectorstore.html):# Assuming you have an existing FAISS vector store `db` and an updated document text updated_text = "your_updated_document_text_here" # Remove the old version of the document from the vector store (if necessary) # You may need to implement this step depending on your specific use case # Create embeddings for the new version of the document # This step may vary depending on the embeddings model you're using # Add the new embeddings and the updated document to the vector store db.add_texts([updated_text])Keep in mind that this example assumes you have already set up a FAISS vector store (
db) and have an embeddings model. If you need more information on setting up a FAISS vector store, you can refer to thePython documentation
(https://python.langchain.com/en/latest/reference/modules/vectorstore.html).
Hello, the link you provided is a 404 error page. Is there anything else I can refer to?
You can probably use this: https://python.langchain.com/en/latest/reference/modules/vectorstores.html?highlight=add_texts#langchain.vectorstores.Annoy.add_texts
Remove the old version of the document from the vector store (if it's stored in the docstore).
How do I do this?
How to delete a document? Did it work for you?
do i have to maintain all doc ids with me to delete a document ? how can i achieve this to update / delete a doc @trancethehuman
You can also look at FAISS's docs for insert/modify/delete operations. I haven't seen LangChain's abstraction for this yet. https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes
import faiss
# create an index
d = 64
index = faiss.IndexFlatL2(d)
# add some vectors
xb = faiss.rand((100, d))
index.add(xb)
# update a vector
new_vector = faiss.rand((1, d))
index.replace(0, new_vector)
# print the updated vector
print(index.reconstruct(0))
faiss-cpu might be a bit different (maybe)
The langchain logic is not there yet. To make it right, I think this would require quite some effort.
Yet, as trancethehuman said, you can work this out directly with FAISS APIs.
You can also look at FAISS's docs for insert/modify/delete operations.
For example, metadata could be used to filter docs/embedding vectors to remove. See the example below.
def remove(
vectorstore: FAISS,
target_metadata: dict
):
id_to_remove = []
for _id, doc in vectorstore.docstore._dict.items():
to_remove = True
for k, v in target_metadata.items():
if doc.metadata[k] != v:
to_remove = False
break
if to_remove:
id_to_remove.append(_id)
docstore_id_to_index = {
v: k for k, v in vectorstore.index_to_docstore_id.items()
}
n_removed = len(id_to_remove)
n_total = vectorstore.index.ntotal
for _id in id_to_remove:
# remove the document from the docstore
del vectorstore.docstore._dict[
_id
]
# remove the embedding from the index
ind = docstore_id_to_index[_id]
vectorstore.index.remove_ids(
np.array([ind], dtype=np.int64)
)
# remove the index to docstore id mapping
del vectorstore.index_to_docstore_id[
ind
]
# reorder the mapping
vectorstore.index_to_docstore_id = {
i: _id
for i, _id in enumerate(self.index_to_docstore_id.values())
}
return n_removed, n_total
This could be a way around #3896 and #5065.
It would work but it loops over every vector in the store to compare. That would be horrible with large vectorstores.
Hi @Xmaster6y. Thanks for your code snippet above. I tried to use it but vectorstore.index.remove_ids seems to mess the correspondence between the vectorstore and the actual Faiss index: after the function, I don't get the same value for vectorstore.index.ntotal and len(vectorstore.docstore._dict).
Shouldn't we do something like the following?
Your version:
for _id in id_to_remove:
# remove the document from the docstore
del vectorstore.docstore._dict[
_id
]
# remove the embedding from the index
ind = docstore_id_to_index[_id]
vectorstore.index.remove_ids(
np.array([ind], dtype=np.int64)
)
# remove the index to docstore id mapping
del vectorstore.index_to_docstore_id[
ind
]
New version:
vectors_to_remove = [] ### Modification here ########################
for _id in id_to_remove:
# remove the document from the docstore
del vectorstore.docstore._dict[
_id
]
# remove the embedding from the index
ind = docstore_id_to_index[_id]
vectors_to_remove.append(ind) ### Modification here ########################
# remove the index to docstore id mapping
del vectorstore.index_to_docstore_id[
ind
]
vectorstore.index.remove_ids(
np.array(vectors_to_remove, dtype=np.int64)
) ### Modification here ########################
You are right my code fails to update docstore ids. It shouldn't be that hard to do btw.
But I think yours won't work since Faiss translates the indices when removing vectors.
@Xmaster6y below steps in the code provided, works fine for me for a list of ids (id_to_remove) and also vectorstore.index.ntotal and len(vectorstore.docstore._dict) gives same value. Is there anything I am missing, especially this step commented on May 24 -"You are right my code fails to update docstore ids. It shouldn't be that hard to do btw.". Is it possible to add the missing step/ complete the code below. Thank you so much!
for _id in id_to_remove: # remove the document from the docstore del vectorstore.docstore._dict[ _id ] # remove the embedding from the index ind = docstore_id_to_index[_id] vectorstore.index.remove_ids( np.array([ind], dtype=np.int64) ) # remove the index to docstore id mapping del vectorstore.index_to_docstore_id[ ind ] # reorder the mapping vectorstore.index_to_docstore_id = { i: _id for i, _id in enumerate(self.index_to_docstore_id.values()) }
@kaushikusc
Before the for loop, I did id_to_remove.sort(key=lambda x: docstore_id_to_index[x], reverse=True) so that vectorstore.index.remove_ids(... ind ...) works well.
Because indices in vectorstore.index seem to be consecutive integers starting from 0, you have to remove the biggest index first.
For me, that deletion code seems to work. It removes from docstore, from the embeddings and from mapping.
Then you can (re-)add the documents of ids you just removed. That's an update.
@Pixcoder Beware of Faiss overwriting indices. Removing the biggest indices first seems superfluous since you can remove multiple indices simultaneously. Re-indexing index_to_docstore_id is always necessary.
For the record, here is the current function I am using:
def remove(vectorstore: FAISS, docstore_ids: Optional[List[str]]):
"""
Function to remove documents from the vectorstore.
Parameters
----------
vectorstore : FAISS
The vectorstore to remove documents from.
docstore_ids : Optional[List[str]]
The list of docstore ids to remove. If None, all documents are removed.
Returns
-------
n_removed : int
The number of documents removed.
n_total : int
The total number of documents in the vectorstore.
Raises
------
ValueError
If there are duplicate ids in the list of ids to remove.
"""
if docstore_ids is None:
vectorstore.docstore = {}
vectorstore.index_to_docstore_id = {}
n_removed = vectorstore.index.ntotal
n_total = vectorstore.index.ntotal
vectorstore.index.reset()
return n_removed, n_total
set_ids = set(docstore_ids)
if len(set_ids) != len(docstore_ids):
raise ValueError("Duplicate ids in list of ids to remove.")
index_ids = [
i_id
for i_id, d_id in vectorstore.index_to_docstore_id.items()
if d_id in docstore_ids
]
n_removed = len(index_ids)
n_total = vectorstore.index.ntotal
vectorstore.index.remove_ids(np.array(index_ids, dtype=np.int64))
for i_id, d_id in zip(index_ids, docstore_ids):
del vectorstore.docstore._dict[
d_id
] # remove the document from the docstore
del vectorstore.index_to_docstore_id[
i_id
] # remove the index to docstore id mapping
vectorstore.index_to_docstore_id = {
i: d_id
for i, d_id in enumerate(vectorstore.index_to_docstore_id.values())
}
return n_removed, n_total
@Xmaster6y Thank you. Not sure, but does "for i_id, d_id in zip(index_ids, docstore_ids):" behave exactly like you intend it to behave when index_ids and docstore_ids are not of same length? Can happen if docstore_ids has ids not in vectorstore...
@Pixcoder This should never happen, as the only purpose of the docstore is to store documents indexed by the vectorstore index. I think you could raise an error if you start manipulating the docstore. I think you should avoid this manipulation if possible, but thanks for pointing this out.
There is an explanation in this video https://youtu.be/hSuCT6Z2QLk
@Xmaster6y Thanks for the code.
Should deleting an index also delete the embedding related to that index?
After deleting 3 from 5 indices and running for each id
index.reconstruct(id)
I see that 4 embedding vectors are equal. I expected the total length to be 2 (as results with index.ntotal) and not still 5.
Any ideas how to clean that up?