How to use faiss local saving/loading
Hi, I see that functionality for saving/loading FAISS index data was recently added in https://github.com/hwchase17/langchain/pull/676
I just tried using local faiss save/load, but having some trouble. My use case is that I want to save some embedding vectors to disk and then rebuild the search index later from the saved file. I'm not sure how to do this; when I build a new index and then attempt to load data from disk, subsequent searches appear not to use the data loaded from disk.
In the example below (using langchain==0.0.73), I...
- build an index from texts
["a"] - save that index to disk
- build a placeholder index from texts
["b"] - attempt to read the original
["a"]index from disk - the new index returns text
"b"though- this was just a placeholder text i used to construct the index object before loading the data i wanted from disk. i expected that the index data would be overwritten by
"a", but that doesn't seem to be the case
- this was just a placeholder text i used to construct the index object before loading the data i wanted from disk. i expected that the index data would be overwritten by
I think I might be missing something, so any advice for working with this API would be appreciated. Great library btw!
import tempfile
from typing import List
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.faiss import FAISS
class FakeEmbeddings(Embeddings):
"""Fake embeddings functionality for testing."""
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Return simple embeddings."""
return [[i] * 10 for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
"""Return simple embeddings."""
return [0] * 10
index = FAISS.from_texts(["a"], FakeEmbeddings())
print(index.similarity_search("a", 1))
# [Document(page_content='a', lookup_str='', metadata={}, lookup_index=0)]
file = tempfile.NamedTemporaryFile()
index.save_local(file.name)
new_index = FAISS.from_texts(["b"], FakeEmbeddings())
new_index.load_local(file.name)
print(new_index.similarity_search("a", 1))
# [Document(page_content='b', lookup_str='', metadata={}, lookup_index=0)]
I too am looking for a canonical example for saving and (later) loading indexes. Please share is you have gotten past this issue.
Dealing with the same issue here. Searched it on github and could only find this repo: https://github.com/hwchase17/langchain/blob/0b9f086d3632992e7e15b2e8b62177338bd7c7b3/tests/integration_tests/vectorstores/test_faiss.py#L87
where they seem to be using the function as OP posted.
I did some poking around and found that the function used to search a query uses self.index_to_docstore_id and self.docstore, both of which are not updated when load_local is called.
def similarity_search_with_score(
self, query: str, k: int = 4
) -> List[Tuple[Document, float]]:
"""Return docs most similar to query.
Args:
query: Text to look up documents similar to.
k: Number of Documents to return. Defaults to 4.
Returns:
List of Documents most similar to the query and score for each
"""
embedding = self.embedding_function(query)
scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
docs = []
for j, i in enumerate(indices[0]):
if i == -1:
# This happens when not enough docs are returned.
continue
_id = self.index_to_docstore_id[i]
doc = self.docstore.search(_id)
if not isinstance(doc, Document):
raise ValueError(f"Could not find document for id {_id}, got {doc}")
docs.append((doc, scores[0][j]))
return docs
def load_local(self, path: str) -> None:
"""Load FAISS index from disk.
Args:
path: Path to load FAISS index from.
"""
faiss = dependable_faiss_import()
self.index = faiss.read_index(path)
Below is how FAISS builds their index from text. Looks like storing both embeddings and text is needed to build up the docstore and other relevant class variables.
@classmethod
def from_texts(
cls,
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
**kwargs: Any,
) -> FAISS:
"""Construct FAISS wrapper from raw documents.
This is a user friendly interface that:
1. Embeds documents.
2. Creates an in memory docstore
3. Initializes the FAISS database
This is intended to be a quick way to get started.
Example:
.. code-block:: python
from langchain import FAISS
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
faiss = FAISS.from_texts(texts, embeddings)
"""
faiss = dependable_faiss_import()
embeddings = embedding.embed_documents(texts)
index = faiss.IndexFlatL2(len(embeddings[0]))
index.add(np.array(embeddings, dtype=np.float32))
documents = []
for i, text in enumerate(texts):
metadata = metadatas[i] if metadatas else {}
documents.append(Document(page_content=text, metadata=metadata))
index_to_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
docstore = InMemoryDocstore(
{index_to_id[i]: doc for i, doc in enumerate(documents)}
)
return cls(embedding.embed_query, index, docstore, index_to_id)
Check out https://github.com/hwchase17/langchain/pull/880, I believe it solves this issue.
OK, will try it out
Don't know enough to give immediate feedback except to recommend changing the comment to Load (from Save)
def load_local(self, path: str) -> None:
"""Save FAISS index, docstore, and index_to_docstore_id to disk.
Args:
path: .pkl path to load index, docstore, and index_to_docstore_id from.
"""
self.index, self.docstore, self.index_to_docstore_id = pickle.load(
open(path, "rb")
)
Ah good point, I just updated the branch and also added some new code to save the data in a path instead of a single pkl file. Apparently the current approach runs the risk of saving the index pointer instead of the actual data.
Hey @ShreyJ1729 can you provide a script on how to use the save feature with your changes ?
sure so it's pretty much the same as before, let's say you generate some embeddings and you want to save them. You just run this:
index = FAISS.from_texts(["a"], FakeEmbeddings())
index.save_local("filename")
where "filename" is the name of the directory that the save_local function will create. Inside, you'll find an index.faiss and index.pkl file, containing the index, and the docstore + id map.
Loading is the same, you pass in the directory name.
@ShreyJ1729 @xloem what do you think of this? updated the loading a bit so that you could load as a class method, and added an example in the notebook of doing so #916
@hwchase17 +1 for making load a classmethod -- having to build a placeholder faiss instance was awkward. i haven't tested this yet but it looks like it would work, thank you and @ShreyJ1729
edit: i tried out the latest tagged version and saving/loading is working as expected, thanks again!
Question why i need to pass Embeddings again as a second argument to the load function? Isnt the index already embed? When I load from hard disk does it needs to embed everything again? for example: loaded_index = FAISS.load_local('my_index.faiss', OpenAIEmbeddings()) . I need to pass the second argument or otherwise doesnt work. Why?
@jboverio Without information about the embedding schema, the FAISS database can't correctly embed inputs to search your index.
But this doesnt generate another API call to the embedd model from open ai, correct?