langchain
langchain copied to clipboard
"NoIndexException: Index not found when initializing Chroma from a persisted directory"
I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:
chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl
However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying".
Here is a snippet of the code I am using in a Jupyter notebook:
# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# Section 2 - Initialize Chroma without an embedding function
persist_directory = '.chroma\\index'
db = Chroma(persist_directory=persist_directory)
# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")
# Section 4
# Run the chain on a sample query
query = "The Question - Can you also cite the information you give after your answer?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)
Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!
You may have to use db.persist() after db = Chroma(...) in Section 2.
Yes, once embeddings are stored you can query against them from next time. Although, an embedding against the query has to be done.
Still the same error: NoIndexException: Index not found, please create an instance before querying
In Section 2 after loading the embeddings: Using embedded DuckDB with persistence: data will be stored in: .chroma\index No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
I'm facing the same issue, even after using db.persist( ) and creating the index files. Some issue with langchain wrapper? Just a few things that might help others to arrive at the problem quick:
db._persist_directory # points to the correct directory
db.similarity_search(query) # throws the index error
Additionally, if there is no db and I just pass some random string to persist_directory, db._persist_directory still points to that string.
Probably its not loading the directory at all.
That's quite awkward. I think the problem is directly linked with Langchain's chroma wrapper. There must be some changes in the source code, maybe. @hwchase17 or @jeffchuber what do you think?
I encountered a similar problem.
db = Chroma(persist_directory="./vdb", embedding_function=embeddings)
retriever = db.as_retriever()
qa = RetrievalQAWithSourcesChain.from_chain_type(llm=OpenAI(), retriever=retriever)
result = qa({"question": question}, return_only_outputs=True)
'./vdb' has a previously persisted db. The error is:
File "/Users/bwu/env/openai/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py", line 223, in get_nearest_neighbors
raise NoIndexException("Index not found, please create an instance before querying")
chromadb.errors.NoIndexException: Index not found, please create an instance before querying
I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:
chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl
However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying".
Here is a snippet of the code I am using in a Jupyter notebook:
# Section 1 import os from langchain.vectorstores import Chroma from langchain.chat_models import ChatOpenAI from langchain.chains.question_answering import load_qa_chain # Load environment variables %reload_ext dotenv %dotenv info.env OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Section 2 - Initialize Chroma without an embedding function persist_directory = '.chroma\\index' db = Chroma(persist_directory=persist_directory) # Section 3 # Load chat model and question answering chain llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY) chain = load_qa_chain(llm, chain_type="stuff") # Section 4 # Run the chain on a sample query query = "The Question - Can you also cite the information you give after your answer?" docs = db.similarity_search(query) response = chain.run(input_documents=docs, question=query) print(response)Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!
adding the embedding_function to the chroma call worked for me.
db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
IF you are using your own collection however, you might need to manually assign the collection to the db as it seems to use the default "langchain" or create a duplicate collection.
db._collection=db._client.get_collection("custom-collection")
I'm new to python so verify.
Hi everyone, Jeff from Chroma here. I will be looking into this tomorrow morning and will report back.
h
Hi everyone, Jeff from Chroma here. I will be looking into this tomorrow morning and will report back.
Hi Jeff, I noticed your message on ChromaD's repository. If you need anything else beyond the code I shared, please let me know. I may also send you the created bin and pk files via email.
Just to update you, the first code I used to convert txt files from PDF, divide them into chunks, and create embeddings worked perfectly. I was able to get answers without any problems. However, when I tried to use the whole embedded data again, I encountered the error message "NoIndexException: Index not found when initializing Chroma from a persisted directory."
My main concern is that I don't want to embed the data each time before answering a question. The reason is that processing large PDFs could end up costing a lot in the long term and would not be cost-effective for my purposes. Therefore, I'm open to any solutions you may have.
Let me know if you need any additional information or if you have any suggestions.
@KurtFeynmanGodel My main concern is that I don't want to embed the data each time before answering a question. The reason is that processing large PDFs could end up costing a lot in the long term and would not be cost-effective for my purposes. Therefore, I'm open to any solutions you may have.
here is a trivial example based on the langchain example vectorstore.ipynb
test.py
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma.from_documents(texts, embeddings, collection_name="state-of-union", persist_directory=".chromadb/")
val = state_of_union_store.similarity_search("the", top_n=2)
print(val)
test2.py
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma(collection_name="state-of-union", persist_directory=".chromadb/", embedding_function=embeddings)
val = state_of_union_store.similarity_search("the", top_n=2)
print(val)
This works on my end. Can others try this?
Dear
here is a trivial example based on the langchain example
vectorstore.ipynb
test.pyfrom langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma from langchain.text_splitter import CharacterTextSplitter from langchain.document_loaders import TextLoader loader = TextLoader('../../../state_of_the_union.txt') documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) texts = text_splitter.split_documents(documents) embeddings = OpenAIEmbeddings() state_of_union_store = Chroma.from_documents(texts, embeddings, collection_name="state-of-union", persist_directory=".chromadb/") val = state_of_union_store.similarity_search("the", top_n=2) print(val)
test2.pyfrom langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import Chroma embeddings = OpenAIEmbeddings() state_of_union_store = Chroma(collection_name="state-of-union", persist_directory=".chromadb/", embedding_function=embeddings) val = state_of_union_store.similarity_search("the", top_n=2) print(val)This works on my end. Can others try this?
Dear @jeffchuber I am still facing the same issue even after making some changes to the code as per your previous suggestion. Therefore, I would like to send you the .ipynb files via email along with a brief description of the process. This way, you can review the files and possibly find a solution that will work reliably. Once we have a solution, we can share it with the community. Thank you for your help.
ok sounds great! please send that along via email or discord DM
ok sounds great! please send that along via email or discord DM
Ok Email sent. ([email protected]). Thank you so much @jeffchuber !
Hi everyone, if you are using a notebook -- you need to call client.persist() manually because the garbage collection in a notebook does not call the __del__ lifecycle method on the object.
I just added a PR here to improve our docs on this. https://github.com/chroma-core/docs/pull/44
@jeffchuber Thanks for this. It solved the issue for me.
Thank you @jeffchuber for the proposed solution. Unfortunately, it gave me the same error. I've emailed you regarding this issue.
Let me share the Jupyter notebook codebase here again:
# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.openai import OpenAIEmbeddings
# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# Section 2 - Initialize Chroma with the persisted directory and the embedding function
persist_directory = '.chroma/'
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
db = Chroma(collection_name="letstry", persist_directory=".chromadb/", embedding_function=embeddings)
db.persist() # I added this line
# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.6, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")
#Section 4
# Run the chain on a sample query
query = "What is Harga transport?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)
@murasz I started off with the same problem of no index found but I cannot remember how I got over that. Can you make sure your persist directory is correct? Now I encounter a different problem. When initializing from the index, if I do not pass in an embedding function, it warns
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingfFunction
and then after it initializes the database, and when trying to retrieve it says there is a dimentionality mismatch in the embeddings
Dimensionality of (384) does not match with index dimensionality (1024)
So it seems hnswlib.py, which I assume is responsible for the dimensionality check, uses the defaulted embedding_function to get an embedding dimension number which it checks against the embedding database as a constraint before retrieval. As a result, without having a thorough look at the code, it would seem like including the embedding_function when restoring from an index might not recompute embeddings from the collection (which I understand is your major concern) but rather just check dimensionality as a constraint. It certainly seems that way from how fast the database gets restored when I initialize with the embedding_function.
@jeffchuber Any insights?
EDIT: Now I think about it, the embedding_function would be used in computing the embeddings of your query string to enable it compute the similarity search (cosine or euclidean distance, or whatever algorithm it uses) and that is likely the cause of the dimentioanlity violation in my case because it computes with the default SentenceTransformerEmbeddingFunction which I assume produces embeddings with a 384 dimensional space. This leads to the question, if you restore a database without the embedding_function, how would you expect duckdb to compute the embedding of a query string to perform a similarity search? Perhaps you can manually set the embedding function after index initialization?
@KurtFeynmanGodel yes - I can help with this.
Chroma relies on the user to tell it how to embed things. This is not currently stored within chroma, which is why you have to pass it to Chroma whenever you do create_collection, get_collection, or get_or_create_collection.
What is happening here is that Chroma is using the default embedding function (sentence-transformers) to embed your document/query with 384 dimensionality. As this does not match the dimensionality of the embedding model you use 1024 - it throws an error.
Hi @jeffchuber and team. Not sure if this issue is still being worked or if it's considered resolved, but I'm running into the same issues as the rest of the thread.
I'm using an AWS-hosted Chroma instance with persistence, and I can store & retrieve successfully in the same notebook.
But when I switch to a new notebook (eventually to be replaced by a web app) I can instantiate the AWS Chroma instance but receive what appears to be a very similar issue.
Exception: {"error":"NoIndexException('Index not found, please create an instance before querying')"}
I use db.persist() as advised, but still doesn't seem to work in a separate notebook.
The same issue was resolved on my Ubuntu server by restarting my python app :smile:
Hi @jeffchuber @murasz , I found the issue. So the issue is with the uuid creation and fetching. The issue of index not found was only happening in the case of custom collection_name. So you guys have used this mechanism to create uuid for the saved index files : uuid.uuid4(), which generates and gets a new uuid every time it is called.
ref file : chromadb/db/clickhouse.py , chromadb/db/duckdb.py

I instead changed the uuid mechanism while creating and fetching collections to do something like this using a seed as collection_name parameter which generates a consistent uuid for a fixed value of collection_name.
rd = random.Random()
rd.seed(collection_name)
id = uuid.UUID(int=rd.getrandbits(128))
collection_uuid = id
This generates a consistent UUID for a collection based on the name passed in the params. Do the same thing in get_collection_uuid_from_name() functions as well .
I tried this in various scenarios where it was failing for me, this solution works in those scenarios.
same issue : raise NoIndexException("Index not found, please create an instance before querying") chromadb.errors.NoIndexException: Index not found, please create an instance before querying
I am facing the same issue, I am using a notebook to create a persisted chroma db instance locally and then copy the data over to the server, my file structure is:
the file with the following code is on the same level as the db folder:
persist_directory = 'db'
embedding = OpenAIEmbeddings()
chroma_client = Chroma(persist_directory=persist_directory,
embedding_function=embedding)
what could be the bug here? My notebook also includes the following lines of code that load the data into the db, but i don't include in the server file for the obvious reasons. I am just trying to perform inference againt the DB.
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)
vectordb.persist()
@dankolesnikov did you create the Chroma DB outside LangChain and now you want to pass to LangChain?
@jeffchuber Sorry I forgot to update my thread. I was able to resolve this error simply by upgrading chroma! I am not facing this error anymore but I uncovered another one and created a ticket on chroma's repo: https://github.com/chroma-core/chroma/issues/640
Please look into it if possible, I owe you and Anton a beer if it is something that I overlooked.
@murasz can you try upgrading chromadb/langchain packages? if it works for you we could close this issue
@dankolesnikov great! glad to hear it. will take a look at 640
@murasz can you try upgrading chromadb/langchain packages? if it works for you we could close this issue
Hey. Thanks for the information. Let me check it again today.
Not sure if it worked for @murasz but I can report that upgrading langchain ( pip install --upgrade langchain) worked for me.
I haven't checked it yet due to my intensive schedule this week. Unless you won't let me check till monday, you can close the topic.