langchain
langchain copied to clipboard
saving and loading embedding from Chroma
Issue with current documentation:
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)
# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# print results
print(docs[0].page_content)
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)
# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)
Idea or request for content:
In above code, I find it difficult to understand this paragraph:
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)
# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)
Although db2 and db3 do demonstrate the saving and loading of Chroma,
But Two pieces of code( docs = db.similarity_search(query) ) have nothing to do with saving and loading,
and it still searches for answers from the db
Is this an error?
I feel the question makes a lot of sense. would you expect something like this?
# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)
# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# print results
print(docs[0].page_content)
# save to disk
# Note: The following code is demonstrating how to save the Chroma database to disk.
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
# load from disk
# Note: The following code is demonstrating how to load the Chroma database from disk.
db3 = Chroma(persist_directory="./chroma_db")
# perform a similarity search on the loaded database
# Note: This is to demonstrate that the loaded database is functioning correctly.
docs = db3.similarity_search(query)
print(docs[0].page_content)
I tested it, need to pass a parameter(embedding_function) to Chroma
like this: Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
Then it can run
yes,I have a similar question that when I load vectors from db, why I still need to pass an embedding params? docSearch = Chroma(persist_directory="D:/vector_store", embedding_function=embeddings)
and I think the param "embedding_function" is unnecessary, isn't it? but when I run the code, it will be failed without param "embedding_function", who can give me an answer why ?
I had the same issue here. Thanks @Lufffya !
But is very strange you have to load the embedding model into the Chroma database, rather than with the search query...
Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?
Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?
because input for search need call embedding_function to get input embeddings, I guess
The following line of code
db2.persist()
is missing from the current langchain documentation (https://python.langchain.com/docs/integrations/vectorstores/chroma)
Hi, @Lufffya! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you raised was about confusion regarding the code snippet provided for saving and loading embeddings from Chroma. You were finding it difficult to understand why the code still searches for answers from the db after saving and loading, and you questioned if this was an error. However, it seems that the issue has been resolved by passing a parameter embedding_function to Chroma. This resolves the confusion regarding the code snippet searching for answers from the db after saving and loading. You tested the code and confirmed that passing embedding_function resolves the issue. Other users also had similar questions and confirmed that passing embedding_function is necessary. Additionally, it was pointed out that the line db2.persist() is missing from the documentation.
Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.
When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| local_llm
| StrOutputParser()
)
The model responds that context is empty.
If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents(
documents=documents,
collection_name=collection_name,
embedding=emb,
persist_directory=sys.argv[1]+"-db",
)
the model gets a context.
How come?
When using
vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb)withemb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), andretriever = vectorstore.as_retriever()andchain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | local_llm | StrOutputParser() )The model responds that context is empty.
If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents( documents=documents, collection_name=collection_name, embedding=emb, persist_directory=sys.argv[1]+"-db", )the model gets a context.
How come?
I am also facing the same issue! Any idea why it is so?
When using
vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb)withemb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), andretriever = vectorstore.as_retriever()andchain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | local_llm | StrOutputParser() )The model responds that context is empty.
If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents( documents=documents, collection_name=collection_name, embedding=emb, persist_directory=sys.argv[1]+"-db", )the model gets a context.
How come?
Hey @Bardo-Konrad , I'm facing the same issue, can you please let me know is it working now...?
When using
vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb)withemb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), andretriever = vectorstore.as_retriever()andchain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | local_llm | StrOutputParser() )The model responds that context is empty.
If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents( documents=documents, collection_name=collection_name, embedding=emb, persist_directory=sys.argv[1]+"-db", )the model gets a context.
How come?
Hey
When using
vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb)withemb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), andretriever = vectorstore.as_retriever()andchain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | local_llm | StrOutputParser() )The model responds that context is empty. If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents( documents=documents, collection_name=collection_name, embedding=emb, persist_directory=sys.argv[1]+"-db", )the model gets a context. How come?
I am also facing the same issue! Any idea why it is so?
Hey @nidhin-krishnakumar You have to include the collection name while loading from the disk Here's the working one
# Saving the data
vector_db_dir = "chroma_vector_db"
vector_db = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
collection_name="local-rag",
persist_directory=vector_db_dir
)
# Loading the data
load_vector_db=Chroma(persist_directory="chroma_vector_db",embedding_function=OllamaEmbeddings(model="nomic-embed-text"),collection_name="local-rag" )
retriever = MultiQueryRetriever.from_llm(
# vector_db.as_retriever(),
load_vector_db.as_retriever(),
llm,
prompt=QUERY_PROMPT
)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
I feel the question makes a lot of sense. would you expect something like this?
# import from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import Chroma from langchain.document_loaders import TextLoader # load the document and split it into chunks loader = TextLoader("../../../state_of_the_union.txt") documents = loader.load() # split it into chunks text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents) # create the open-source embedding function embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # load it into Chroma db = Chroma.from_documents(docs, embedding_function) # query it query = "What did the president say about Ketanji Brown Jackson" docs = db.similarity_search(query) # print results print(docs[0].page_content) # save to disk # Note: The following code is demonstrating how to save the Chroma database to disk. db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db") db2.persist() # load from disk # Note: The following code is demonstrating how to load the Chroma database from disk. db3 = Chroma(persist_directory="./chroma_db") # perform a similarity search on the loaded database # Note: This is to demonstrate that the loaded database is functioning correctly. docs = db3.similarity_search(query) print(docs[0].page_content)
Wonderful demonstration! Thank you!