langchain icon indicating copy to clipboard operation
langchain copied to clipboard

How to delete or update a document within a FAISS index?

Open sanasz91mdev opened this issue 3 years ago • 10 comments

Hi,

I have a usecase where i have to fetch Edited posts weekly from community and update the docs within FAISS index. is that possible? or do i have to keep deleting and create new index everytime?

Also i use RecursiveCharacterTextSplitter to split docs.

loader = DirectoryLoader('./recent_data')
  raw_documents = loader.load()
  #Splitting documents into chunks
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=1000,
      chunk_overlap=200,
  )
  documents = text_splitter.split_documents(raw_documents)
  print(len(documents))
  # Changing source to point to the original document
  for x in documents:
      print(x.metadata["source"])
  # Creating index and saving it to disk
  print("Creating index")
  db_new = FAISS.from_documents(documents, embeddings )

this is output if i use print(db_new .docstore._dict)

{'2d9b6fbf-a44d-46b5-bcdf-b45cd9438a4c': Document(page_content='<p dir="auto">This is a test topic.</p>', metadata={'source': 'recent/https://community.tpsonline.com/topic/587/ignore-test-topic'}), '706dcaf8-f9d9-45b9-bdf4-8a8ac7618229': Document(page_content='What is an SDD?\n\n<p dir="auto">A software design description (a.k.a. software design document or SDD; just design document; also Software Design Specification) is a representation of a software design that is to be used for recording design information, addressing various design concerns, and communicating that information to the different stakeholders.</p>\n\n<p dir="auto">This SDD template represent design w.r.t various software viewpoints, where each viewpoint will handle specific concerns of Design. This is based on <strong>ISO 42010 standard</strong>.</p>\n\nIntroduction\n\n<p dir="auto">[Name/brief description of feature for which SDD is being Produced]</p>\n\n1. Context Viewpoint\n\n<p dir="auto">[Describes the relationships, dependencies, and interactions between the system and its environment ]</p>\n\n1.1 Use Cases\n\n1.1.1 AS IS (Pre Condition)\n\n1.1.2 TO - BE (Post Condition)\n\n1.2 System Context View\n\n1.2.1 - AS IS (Pre Condition)\n\n1.2.2 TO - BE (Post Condition)\n\n2. Logical Viewpoint', metadata={'source': 'recent/https://community.tpsonline.com/topic/586/software-design-description-sdd-template'}), '4d6d4e6b-01ee-46bb-ae06-84514a51baf2': Document(page_content='1.1 Use Cases\n\n1.1.1 AS IS (Pre Condition)\n\n1.1.2 TO - BE (Post Condition)\n\n1.2 System Context View\n\n1.2.1 - AS IS (Pre Condition)\n\n1.2.2 TO - BE (Post Condition)\n\n2. Logical Viewpoint\n\n<p dir="auto">[The purpose of the Logical viewpoint is to elaborate existing and designed types and their implementations as classes and interfaces with their structural static relationships]</p>\n\n2.1 Class Diagram\n\n2.1.1 AS - IS (Pre Condition)\n\n2.1.2 TO - BE (Post Condition)\n\n2.1.2.1 Class Interfaces and description\n\n<p dir="auto">[Below is being presented as an example]<br />\n\n[This section should tell about the responsibility of each class method and their parameters too if required]</p>\n\n2.1.2.1.1 IRenewProcess\n\nMethod\n\nDescription\n\nprocessRenewal\n\nMethod to process renewal of a given cardEntity. Each concrete class that will implement the interface will implement its own version of renewal steps\n\n2.1.2.1.1 RenewStrategyContext (static class)\n\nMethod\n\nDescription\n\n(private)getRenewalMethod', metadata={'source': 'recent/https://community.tpsonline.com/topic/586/software-design-description-sdd-template'})}

so will i be able to update docs within index or is it just not possible?

sanasz91mdev avatar Apr 11 '23 06:04 sanasz91mdev

I'm also interested in a way to update docs that are already in a vectorstore

trancethehuman avatar Apr 16 '23 02:04 trancethehuman

I went and asked Kapa.ai, and here's the answer:

To update an existing FAISS vector store with a new version of your document, you can follow these steps:

  1. Remove the old version of the document from the vector store (if it's stored in the docstore).
  2. Create embeddings for the new version of the document.
  3. Add the new embeddings and the updated document to the vector store.

Here's an example using the add_texts method from the Python documentation(https://python.langchain.com/en/latest/reference/modules/vectorstore.html):

# Assuming you have an existing FAISS vector store `db` and an updated document text
updated_text = "your_updated_document_text_here"

# Remove the old version of the document from the vector store (if necessary)
# You may need to implement this step depending on your specific use case

# Create embeddings for the new version of the document
# This step may vary depending on the embeddings model you're using

# Add the new embeddings and the updated document to the vector store
db.add_texts([updated_text])

Keep in mind that this example assumes you have already set up a FAISS vector store (db) and have an embeddings model. If you need more information on setting up a FAISS vector store, you can refer to the

Python documentation

(https://python.langchain.com/en/latest/reference/modules/vectorstore.html).

trancethehuman avatar Apr 16 '23 02:04 trancethehuman

I went and asked Kapa.ai, and here's the answer:

To update an existing FAISS vector store with a new version of your document, you can follow these steps:

  1. Remove the old version of the document from the vector store (if it's stored in the docstore).
  2. Create embeddings for the new version of the document.
  3. Add the new embeddings and the updated document to the vector store.

Here's an example using the add_texts method from the Python documentation(https://python.langchain.com/en/latest/reference/modules/vectorstore.html):

# Assuming you have an existing FAISS vector store `db` and an updated document text
updated_text = "your_updated_document_text_here"

# Remove the old version of the document from the vector store (if necessary)
# You may need to implement this step depending on your specific use case

# Create embeddings for the new version of the document
# This step may vary depending on the embeddings model you're using

# Add the new embeddings and the updated document to the vector store
db.add_texts([updated_text])

Keep in mind that this example assumes you have already set up a FAISS vector store (db) and have an embeddings model. If you need more information on setting up a FAISS vector store, you can refer to the

Python documentation

(https://python.langchain.com/en/latest/reference/modules/vectorstore.html).

Hello, the link you provided is a 404 error page. Is there anything else I can refer to?

wulaoshi avatar Apr 28 '23 07:04 wulaoshi

You can probably use this: https://python.langchain.com/en/latest/reference/modules/vectorstores.html?highlight=add_texts#langchain.vectorstores.Annoy.add_texts

Screenshot 2023-04-28 at 9 14 32 AM

trancethehuman avatar Apr 28 '23 13:04 trancethehuman

Remove the old version of the document from the vector store (if it's stored in the docstore).

How do I do this?

computaco-inc avatar May 10 '23 14:05 computaco-inc

How to delete a document? Did it work for you?

virdi16 avatar May 11 '23 11:05 virdi16

do i have to maintain all doc ids with me to delete a document ? how can i achieve this to update / delete a doc @trancethehuman

sanasz91mdev avatar May 11 '23 11:05 sanasz91mdev

You can also look at FAISS's docs for insert/modify/delete operations. I haven't seen LangChain's abstraction for this yet. https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes

import faiss

# create an index
d = 64
index = faiss.IndexFlatL2(d)

# add some vectors
xb = faiss.rand((100, d))
index.add(xb)

# update a vector
new_vector = faiss.rand((1, d))
index.replace(0, new_vector)

# print the updated vector
print(index.reconstruct(0))

faiss-cpu might be a bit different (maybe)

trancethehuman avatar May 14 '23 14:05 trancethehuman

The langchain logic is not there yet. To make it right, I think this would require quite some effort.

Yet, as trancethehuman said, you can work this out directly with FAISS APIs.

You can also look at FAISS's docs for insert/modify/delete operations.

For example, metadata could be used to filter docs/embedding vectors to remove. See the example below.

def remove(
    vectorstore: FAISS,
    target_metadata: dict
):
    id_to_remove = []
    for _id, doc in vectorstore.docstore._dict.items():
        to_remove = True
        for k, v in target_metadata.items():
            if doc.metadata[k] != v:
                to_remove = False
                break
        if to_remove:
            id_to_remove.append(_id)
    docstore_id_to_index = {
        v: k for k, v in vectorstore.index_to_docstore_id.items()
    }
    n_removed = len(id_to_remove)
    n_total = vectorstore.index.ntotal
    for _id in id_to_remove:
        # remove the document from the docstore
        del vectorstore.docstore._dict[
            _id
        ]
        # remove the embedding from the index
        ind = docstore_id_to_index[_id]
        vectorstore.index.remove_ids(
            np.array([ind], dtype=np.int64)
        ) 
        # remove the index to docstore id mapping
        del vectorstore.index_to_docstore_id[
            ind
        ] 
    # reorder the mapping
    vectorstore.index_to_docstore_id = {
        i: _id
        for i, _id in enumerate(self.index_to_docstore_id.values())
    }
    return n_removed, n_total

This could be a way around #3896 and #5065.

Xmaster6y avatar May 23 '23 14:05 Xmaster6y

It would work but it loops over every vector in the store to compare. That would be horrible with large vectorstores.

atisharma avatar May 23 '23 15:05 atisharma

Hi @Xmaster6y. Thanks for your code snippet above. I tried to use it but vectorstore.index.remove_ids seems to mess the correspondence between the vectorstore and the actual Faiss index: after the function, I don't get the same value for vectorstore.index.ntotal and len(vectorstore.docstore._dict).

Shouldn't we do something like the following?

Your version:

    for _id in id_to_remove:
        # remove the document from the docstore
        del vectorstore.docstore._dict[
            _id
        ]
        # remove the embedding from the index
        ind = docstore_id_to_index[_id]
        vectorstore.index.remove_ids(
            np.array([ind], dtype=np.int64)
        ) 
        # remove the index to docstore id mapping
        del vectorstore.index_to_docstore_id[
            ind
        ]

New version:

    vectors_to_remove = [] ### Modification here ########################
    for _id in id_to_remove:
        # remove the document from the docstore
        del vectorstore.docstore._dict[
            _id
        ]
        # remove the embedding from the index
        ind = docstore_id_to_index[_id]
        vectors_to_remove.append(ind) ### Modification here ########################
        # remove the index to docstore id mapping
        del vectorstore.index_to_docstore_id[
            ind
        ]
    vectorstore.index.remove_ids(
        np.array(vectors_to_remove, dtype=np.int64)
    )  ### Modification here ########################

vivien000 avatar May 24 '23 13:05 vivien000

You are right my code fails to update docstore ids. It shouldn't be that hard to do btw.

But I think yours won't work since Faiss translates the indices when removing vectors.

Xmaster6y avatar May 24 '23 15:05 Xmaster6y

@Xmaster6y below steps in the code provided, works fine for me for a list of ids (id_to_remove) and also vectorstore.index.ntotal and len(vectorstore.docstore._dict) gives same value. Is there anything I am missing, especially this step commented on May 24 -"You are right my code fails to update docstore ids. It shouldn't be that hard to do btw.". Is it possible to add the missing step/ complete the code below. Thank you so much!

for _id in id_to_remove: # remove the document from the docstore del vectorstore.docstore._dict[ _id ] # remove the embedding from the index ind = docstore_id_to_index[_id] vectorstore.index.remove_ids( np.array([ind], dtype=np.int64) ) # remove the index to docstore id mapping del vectorstore.index_to_docstore_id[ ind ] # reorder the mapping vectorstore.index_to_docstore_id = { i: _id for i, _id in enumerate(self.index_to_docstore_id.values()) }

kaushikusc avatar Jun 30 '23 13:06 kaushikusc

@kaushikusc

Before the for loop, I did id_to_remove.sort(key=lambda x: docstore_id_to_index[x], reverse=True) so that vectorstore.index.remove_ids(... ind ...) works well.

Because indices in vectorstore.index seem to be consecutive integers starting from 0, you have to remove the biggest index first.

For me, that deletion code seems to work. It removes from docstore, from the embeddings and from mapping.

Then you can (re-)add the documents of ids you just removed. That's an update.

Pixcoder avatar Jul 03 '23 08:07 Pixcoder

@Pixcoder Beware of Faiss overwriting indices. Removing the biggest indices first seems superfluous since you can remove multiple indices simultaneously. Re-indexing index_to_docstore_id is always necessary.

For the record, here is the current function I am using:

def remove(vectorstore: FAISS, docstore_ids: Optional[List[str]]):
    """
    Function to remove documents from the vectorstore.
    
    Parameters
    ----------
    vectorstore : FAISS
        The vectorstore to remove documents from.
    docstore_ids : Optional[List[str]]
        The list of docstore ids to remove. If None, all documents are removed.
    
    Returns
    -------
    n_removed : int
        The number of documents removed.
    n_total : int
        The total number of documents in the vectorstore.
    
    Raises
    ------
    ValueError
        If there are duplicate ids in the list of ids to remove.
    """
    if docstore_ids is None:
        vectorstore.docstore = {}
        vectorstore.index_to_docstore_id = {}
        n_removed = vectorstore.index.ntotal
        n_total = vectorstore.index.ntotal
        vectorstore.index.reset()
        return n_removed, n_total
    set_ids = set(docstore_ids)
    if len(set_ids) != len(docstore_ids):
        raise ValueError("Duplicate ids in list of ids to remove.")
    index_ids = [
        i_id
        for i_id, d_id in vectorstore.index_to_docstore_id.items()
        if d_id in docstore_ids
    ]
    n_removed = len(index_ids)
    n_total = vectorstore.index.ntotal
    vectorstore.index.remove_ids(np.array(index_ids, dtype=np.int64))
    for i_id, d_id in zip(index_ids, docstore_ids):
        del vectorstore.docstore._dict[
            d_id
        ]  # remove the document from the docstore

        del vectorstore.index_to_docstore_id[
            i_id
        ]  # remove the index to docstore id mapping
    vectorstore.index_to_docstore_id = {
        i: d_id
        for i, d_id in enumerate(vectorstore.index_to_docstore_id.values())
    }
    return n_removed, n_total

Xmaster6y avatar Jul 03 '23 12:07 Xmaster6y

@Xmaster6y Thank you. Not sure, but does "for i_id, d_id in zip(index_ids, docstore_ids):" behave exactly like you intend it to behave when index_ids and docstore_ids are not of same length? Can happen if docstore_ids has ids not in vectorstore...

Pixcoder avatar Jul 03 '23 20:07 Pixcoder

@Pixcoder This should never happen, as the only purpose of the docstore is to store documents indexed by the vectorstore index. I think you could raise an error if you start manipulating the docstore. I think you should avoid this manipulation if possible, but thanks for pointing this out.

Xmaster6y avatar Jul 03 '23 21:07 Xmaster6y

There is an explanation in this video https://youtu.be/hSuCT6Z2QLk

GihanMora avatar Aug 14 '23 12:08 GihanMora

@Xmaster6y Thanks for the code. Should deleting an index also delete the embedding related to that index? After deleting 3 from 5 indices and running for each id index.reconstruct(id) I see that 4 embedding vectors are equal. I expected the total length to be 2 (as results with index.ntotal) and not still 5. image Any ideas how to clean that up?

chirico85 avatar Aug 18 '23 07:08 chirico85