langchain icon indicating copy to clipboard operation
langchain copied to clipboard

How to use faiss local saving/loading

Open bi1yeu opened this issue 2 years ago • 10 comments

Hi, I see that functionality for saving/loading FAISS index data was recently added in https://github.com/hwchase17/langchain/pull/676

I just tried using local faiss save/load, but having some trouble. My use case is that I want to save some embedding vectors to disk and then rebuild the search index later from the saved file. I'm not sure how to do this; when I build a new index and then attempt to load data from disk, subsequent searches appear not to use the data loaded from disk.

In the example below (using langchain==0.0.73), I...

  • build an index from texts ["a"]
  • save that index to disk
  • build a placeholder index from texts ["b"]
  • attempt to read the original ["a"] index from disk
  • the new index returns text "b" though
    • this was just a placeholder text i used to construct the index object before loading the data i wanted from disk. i expected that the index data would be overwritten by "a", but that doesn't seem to be the case

I think I might be missing something, so any advice for working with this API would be appreciated. Great library btw!

import tempfile
from typing import List

from langchain.embeddings.base import Embeddings
from langchain.vectorstores.faiss import FAISS


class FakeEmbeddings(Embeddings):
    """Fake embeddings functionality for testing."""

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Return simple embeddings."""
        return [[i] * 10 for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        """Return simple embeddings."""
        return [0] * 10


index = FAISS.from_texts(["a"], FakeEmbeddings())
print(index.similarity_search("a", 1))
# [Document(page_content='a', lookup_str='', metadata={}, lookup_index=0)]

file = tempfile.NamedTemporaryFile()
index.save_local(file.name)

new_index = FAISS.from_texts(["b"], FakeEmbeddings())
new_index.load_local(file.name)
print(new_index.similarity_search("a", 1))
# [Document(page_content='b', lookup_str='', metadata={}, lookup_index=0)]

bi1yeu avatar Jan 28 '23 22:01 bi1yeu

I too am looking for a canonical example for saving and (later) loading indexes. Please share is you have gotten past this issue.

aurotripathy avatar Feb 04 '23 00:02 aurotripathy

Dealing with the same issue here. Searched it on github and could only find this repo: https://github.com/hwchase17/langchain/blob/0b9f086d3632992e7e15b2e8b62177338bd7c7b3/tests/integration_tests/vectorstores/test_faiss.py#L87

where they seem to be using the function as OP posted.

I did some poking around and found that the function used to search a query uses self.index_to_docstore_id and self.docstore, both of which are not updated when load_local is called.

def similarity_search_with_score(
        self, query: str, k: int = 4
    ) -> List[Tuple[Document, float]]:
        """Return docs most similar to query.

        Args:
            query: Text to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.

        Returns:
            List of Documents most similar to the query and score for each
        """
        embedding = self.embedding_function(query)
        scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
        docs = []
        for j, i in enumerate(indices[0]):
            if i == -1:
                # This happens when not enough docs are returned.
                continue
            _id = self.index_to_docstore_id[i]
            doc = self.docstore.search(_id)
            if not isinstance(doc, Document):
                raise ValueError(f"Could not find document for id {_id}, got {doc}")
            docs.append((doc, scores[0][j]))
        return docs
def load_local(self, path: str) -> None:
        """Load FAISS index from disk.

        Args:
            path: Path to load FAISS index from.
        """
        faiss = dependable_faiss_import()
        self.index = faiss.read_index(path)

ShreyJ1729 avatar Feb 04 '23 06:02 ShreyJ1729

Below is how FAISS builds their index from text. Looks like storing both embeddings and text is needed to build up the docstore and other relevant class variables.

@classmethod
    def from_texts(
        cls,
        texts: List[str],
        embedding: Embeddings,
        metadatas: Optional[List[dict]] = None,
        **kwargs: Any,
    ) -> FAISS:
        """Construct FAISS wrapper from raw documents.

        This is a user friendly interface that:
            1. Embeds documents.
            2. Creates an in memory docstore
            3. Initializes the FAISS database

        This is intended to be a quick way to get started.

        Example:
            .. code-block:: python

                from langchain import FAISS
                from langchain.embeddings import OpenAIEmbeddings
                embeddings = OpenAIEmbeddings()
                faiss = FAISS.from_texts(texts, embeddings)
        """
        faiss = dependable_faiss_import()
        embeddings = embedding.embed_documents(texts)
        index = faiss.IndexFlatL2(len(embeddings[0]))
        index.add(np.array(embeddings, dtype=np.float32))
        documents = []
        for i, text in enumerate(texts):
            metadata = metadatas[i] if metadatas else {}
            documents.append(Document(page_content=text, metadata=metadata))
        index_to_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
        docstore = InMemoryDocstore(
            {index_to_id[i]: doc for i, doc in enumerate(documents)}
        )
        return cls(embedding.embed_query, index, docstore, index_to_id)

ShreyJ1729 avatar Feb 04 '23 06:02 ShreyJ1729

Check out https://github.com/hwchase17/langchain/pull/880, I believe it solves this issue.

ShreyJ1729 avatar Feb 04 '23 06:02 ShreyJ1729

OK, will try it out Don't know enough to give immediate feedback except to recommend changing the comment to Load (from Save)

 def load_local(self, path: str) -> None:
        """Save FAISS index, docstore, and index_to_docstore_id to disk.
        Args:
            path: .pkl path to load index, docstore, and index_to_docstore_id from.
        """
        self.index, self.docstore, self.index_to_docstore_id = pickle.load(
            open(path, "rb")
        )

aurotripathy avatar Feb 04 '23 15:02 aurotripathy

Ah good point, I just updated the branch and also added some new code to save the data in a path instead of a single pkl file. Apparently the current approach runs the risk of saving the index pointer instead of the actual data.

ShreyJ1729 avatar Feb 04 '23 19:02 ShreyJ1729

Hey @ShreyJ1729 can you provide a script on how to use the save feature with your changes ?

gd1m3y avatar Feb 06 '23 10:02 gd1m3y

sure so it's pretty much the same as before, let's say you generate some embeddings and you want to save them. You just run this:

index = FAISS.from_texts(["a"], FakeEmbeddings())
index.save_local("filename")

where "filename" is the name of the directory that the save_local function will create. Inside, you'll find an index.faiss and index.pkl file, containing the index, and the docstore + id map.

Loading is the same, you pass in the directory name.

ShreyJ1729 avatar Feb 06 '23 23:02 ShreyJ1729

@ShreyJ1729 @xloem what do you think of this? updated the loading a bit so that you could load as a class method, and added an example in the notebook of doing so #916

hwchase17 avatar Feb 07 '23 02:02 hwchase17

@hwchase17 +1 for making load a classmethod -- having to build a placeholder faiss instance was awkward. i haven't tested this yet but it looks like it would work, thank you and @ShreyJ1729

edit: i tried out the latest tagged version and saving/loading is working as expected, thanks again!

bi1yeu avatar Feb 07 '23 02:02 bi1yeu

Question why i need to pass Embeddings again as a second argument to the load function? Isnt the index already embed? When I load from hard disk does it needs to embed everything again? for example: loaded_index = FAISS.load_local('my_index.faiss', OpenAIEmbeddings()) . I need to pass the second argument or otherwise doesnt work. Why?

jboverio avatar Apr 11 '23 23:04 jboverio

@jboverio Without information about the embedding schema, the FAISS database can't correctly embed inputs to search your index.

ShreyJ1729 avatar Apr 11 '23 23:04 ShreyJ1729

But this doesnt generate another API call to the embedd model from open ai, correct?

jboverio avatar Apr 11 '23 23:04 jboverio