langchain icon indicating copy to clipboard operation
langchain copied to clipboard

saving and loading embedding from Chroma

Open Lufffya opened this issue 2 years ago • 4 comments
trafficstars

Issue with current documentation:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Idea or request for content:

In above code, I find it difficult to understand this paragraph:

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Although db2 and db3 do demonstrate the saving and loading of Chroma, But Two pieces of code( docs = db.similarity_search(query) ) have nothing to do with saving and loading, and it still searches for answers from the db Is this an error?

Lufffya avatar Jul 05 '23 06:07 Lufffya

I feel the question makes a lot of sense. would you expect something like this?


# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
# Note: The following code is demonstrating how to save the Chroma database to disk.
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()

# load from disk
# Note: The following code is demonstrating how to load the Chroma database from disk.
db3 = Chroma(persist_directory="./chroma_db")

# perform a similarity search on the loaded database
# Note: This is to demonstrate that the loaded database is functioning correctly.
docs = db3.similarity_search(query)
print(docs[0].page_content)

rjarun8 avatar Jul 05 '23 08:07 rjarun8

I tested it, need to pass a parameter(embedding_function) to Chroma like this: Chroma(persist_directory="./chroma_db", embedding_function=embedding_function) Then it can run

Lufffya avatar Jul 05 '23 09:07 Lufffya

yes,I have a similar question that when I load vectors from db, why I still need to pass an embedding params? docSearch = Chroma(persist_directory="D:/vector_store", embedding_function=embeddings)

and I think the param "embedding_function" is unnecessary, isn't it? but when I run the code, it will be failed without param "embedding_function", who can give me an answer why ?

chenzhiang669 avatar Jul 07 '23 03:07 chenzhiang669

I had the same issue here. Thanks @Lufffya !

But is very strange you have to load the embedding model into the Chroma database, rather than with the search query...

jenswilms avatar Jul 07 '23 19:07 jenswilms

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

ajasingh avatar Jul 15 '23 06:07 ajasingh

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

because input for search need call embedding_function to get input embeddings, I guess

Lufffya avatar Aug 15 '23 10:08 Lufffya

The following line of code db2.persist() is missing from the current langchain documentation (https://python.langchain.com/docs/integrations/vectorstores/chroma)

image

baidurja avatar Aug 23 '23 14:08 baidurja

Hi, @Lufffya! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised was about confusion regarding the code snippet provided for saving and loading embeddings from Chroma. You were finding it difficult to understand why the code still searches for answers from the db after saving and loading, and you questioned if this was an error. However, it seems that the issue has been resolved by passing a parameter embedding_function to Chroma. This resolves the confusion regarding the code snippet searching for answers from the db after saving and loading. You tested the code and confirmed that passing embedding_function resolves the issue. Other users also had similar questions and confirmed that passing embedding_function is necessary. Additionally, it was pointed out that the line db2.persist() is missing from the documentation.

Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.

dosubot[bot] avatar Nov 22 '23 16:11 dosubot[bot]

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

Bardo-Konrad avatar Mar 10 '24 14:03 Bardo-Konrad

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

I am also facing the same issue! Any idea why it is so?

nidhin-krishnakumar avatar Apr 09 '24 07:04 nidhin-krishnakumar

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

Hey @Bardo-Konrad , I'm facing the same issue, can you please let me know is it working now...?

puliviswanath avatar May 22 '24 13:05 puliviswanath

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

Hey

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty. If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context. How come?

I am also facing the same issue! Any idea why it is so?

Hey @nidhin-krishnakumar You have to include the collection name while loading from the disk Here's the working one

 # Saving the data
vector_db_dir = "chroma_vector_db"
vector_db = Chroma.from_documents(
   
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag",
     persist_directory=vector_db_dir
    
)
 # Loading the data
load_vector_db=Chroma(persist_directory="chroma_vector_db",embedding_function=OllamaEmbeddings(model="nomic-embed-text"),collection_name="local-rag" )

retriever = MultiQueryRetriever.from_llm(
    # vector_db.as_retriever(), 
    load_vector_db.as_retriever(),
    llm,
    prompt=QUERY_PROMPT
)
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


puliviswanath avatar May 23 '24 07:05 puliviswanath

I feel the question makes a lot of sense. would you expect something like this?


# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
# Note: The following code is demonstrating how to save the Chroma database to disk.
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()

# load from disk
# Note: The following code is demonstrating how to load the Chroma database from disk.
db3 = Chroma(persist_directory="./chroma_db")

# perform a similarity search on the loaded database
# Note: This is to demonstrate that the loaded database is functioning correctly.
docs = db3.similarity_search(query)
print(docs[0].page_content)

Wonderful demonstration! Thank you!

SeanWu089 avatar Jul 01 '24 19:07 SeanWu089