langchain icon indicating copy to clipboard operation
langchain copied to clipboard

"NoIndexException: Index not found when initializing Chroma from a persisted directory"

Open murasz opened this issue 2 years ago • 33 comments

I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:

chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl

However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying".

Here is a snippet of the code I am using in a Jupyter notebook:

# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Section 2 - Initialize Chroma without an embedding function
persist_directory = '.chroma\\index'
db = Chroma(persist_directory=persist_directory)

# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

# Section 4
# Run the chain on a sample query
query = "The Question - Can you also cite the information you give after your answer?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!

murasz avatar Apr 17 '23 10:04 murasz

You may have to use db.persist() after db = Chroma(...) in Section 2.

Yes, once embeddings are stored you can query against them from next time. Although, an embedding against the query has to be done.

bkamapantula avatar Apr 17 '23 12:04 bkamapantula

Still the same error: NoIndexException: Index not found, please create an instance before querying

In Section 2 after loading the embeddings: Using embedded DuckDB with persistence: data will be stored in: .chroma\index No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction

murasz avatar Apr 17 '23 14:04 murasz

I'm facing the same issue, even after using db.persist( ) and creating the index files. Some issue with langchain wrapper? Just a few things that might help others to arrive at the problem quick:

db._persist_directory  # points to the correct directory
db.similarity_search(query) # throws the index error

Additionally, if there is no db and I just pass some random string to persist_directory, db._persist_directory still points to that string. Probably its not loading the directory at all.

KeshavSingh29 avatar Apr 18 '23 08:04 KeshavSingh29

That's quite awkward. I think the problem is directly linked with Langchain's chroma wrapper. There must be some changes in the source code, maybe. @hwchase17 or @jeffchuber what do you think?

murasz avatar Apr 18 '23 15:04 murasz

I encountered a similar problem.

db = Chroma(persist_directory="./vdb", embedding_function=embeddings)    
retriever = db.as_retriever()
qa = RetrievalQAWithSourcesChain.from_chain_type(llm=OpenAI(), retriever=retriever)
result = qa({"question": question}, return_only_outputs=True)

'./vdb' has a previously persisted db. The error is:

File "/Users/bwu/env/openai/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py", line 223, in get_nearest_neighbors
    raise NoIndexException("Index not found, please create an instance before querying")
chromadb.errors.NoIndexException: Index not found, please create an instance before querying

bowu avatar Apr 19 '23 05:04 bowu

I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:

chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl

However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying".

Here is a snippet of the code I am using in a Jupyter notebook:

# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Section 2 - Initialize Chroma without an embedding function
persist_directory = '.chroma\\index'
db = Chroma(persist_directory=persist_directory)

# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

# Section 4
# Run the chain on a sample query
query = "The Question - Can you also cite the information you give after your answer?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!

adding the embedding_function to the chroma call worked for me. db = Chroma(persist_directory=persist_directory, embedding_function=embeddings) IF you are using your own collection however, you might need to manually assign the collection to the db as it seems to use the default "langchain" or create a duplicate collection. db._collection=db._client.get_collection("custom-collection")

I'm new to python so verify.

KurtFeynmanGodel avatar Apr 19 '23 06:04 KurtFeynmanGodel

Hi everyone, Jeff from Chroma here. I will be looking into this tomorrow morning and will report back.

jeffchuber avatar Apr 19 '23 06:04 jeffchuber

h

Hi everyone, Jeff from Chroma here. I will be looking into this tomorrow morning and will report back.

Hi Jeff, I noticed your message on ChromaD's repository. If you need anything else beyond the code I shared, please let me know. I may also send you the created bin and pk files via email.

Just to update you, the first code I used to convert txt files from PDF, divide them into chunks, and create embeddings worked perfectly. I was able to get answers without any problems. However, when I tried to use the whole embedded data again, I encountered the error message "NoIndexException: Index not found when initializing Chroma from a persisted directory."

My main concern is that I don't want to embed the data each time before answering a question. The reason is that processing large PDFs could end up costing a lot in the long term and would not be cost-effective for my purposes. Therefore, I'm open to any solutions you may have.

Let me know if you need any additional information or if you have any suggestions.

murasz avatar Apr 19 '23 11:04 murasz

@KurtFeynmanGodel My main concern is that I don't want to embed the data each time before answering a question. The reason is that processing large PDFs could end up costing a lot in the long term and would not be cost-effective for my purposes. Therefore, I'm open to any solutions you may have.

murasz avatar Apr 19 '23 11:04 murasz

here is a trivial example based on the langchain example vectorstore.ipynb

test.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma.from_documents(texts, embeddings, collection_name="state-of-union", persist_directory=".chromadb/")

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

test2.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma(collection_name="state-of-union", persist_directory=".chromadb/", embedding_function=embeddings)

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

This works on my end. Can others try this?

jeffchuber avatar Apr 19 '23 18:04 jeffchuber

Dear

here is a trivial example based on the langchain example vectorstore.ipynb

test.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma.from_documents(texts, embeddings, collection_name="state-of-union", persist_directory=".chromadb/")

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

test2.py

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
state_of_union_store = Chroma(collection_name="state-of-union", persist_directory=".chromadb/", embedding_function=embeddings)

val = state_of_union_store.similarity_search("the", top_n=2)
print(val)

This works on my end. Can others try this?

Dear @jeffchuber I am still facing the same issue even after making some changes to the code as per your previous suggestion. Therefore, I would like to send you the .ipynb files via email along with a brief description of the process. This way, you can review the files and possibly find a solution that will work reliably. Once we have a solution, we can share it with the community. Thank you for your help.

murasz avatar Apr 19 '23 22:04 murasz

ok sounds great! please send that along via email or discord DM

jeffchuber avatar Apr 19 '23 22:04 jeffchuber

ok sounds great! please send that along via email or discord DM

Ok Email sent. ([email protected]). Thank you so much @jeffchuber !

murasz avatar Apr 19 '23 22:04 murasz

Hi everyone, if you are using a notebook -- you need to call client.persist() manually because the garbage collection in a notebook does not call the __del__ lifecycle method on the object.

I just added a PR here to improve our docs on this. https://github.com/chroma-core/docs/pull/44

jeffchuber avatar Apr 20 '23 04:04 jeffchuber

@jeffchuber Thanks for this. It solved the issue for me.

KeshavSingh29 avatar Apr 20 '23 04:04 KeshavSingh29

Thank you @jeffchuber for the proposed solution. Unfortunately, it gave me the same error. I've emailed you regarding this issue.

Let me share the Jupyter notebook codebase here again:

# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.openai import OpenAIEmbeddings

# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# Section 2 - Initialize Chroma with the persisted directory and the embedding function
persist_directory = '.chroma/'
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
db = Chroma(collection_name="letstry", persist_directory=".chromadb/", embedding_function=embeddings)
db.persist()  # I added this line
# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.6, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")
#Section 4
# Run the chain on a sample query
query = "What is Harga transport?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)

murasz avatar Apr 20 '23 09:04 murasz

@murasz I started off with the same problem of no index found but I cannot remember how I got over that. Can you make sure your persist directory is correct? Now I encounter a different problem. When initializing from the index, if I do not pass in an embedding function, it warns

No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingfFunction

and then after it initializes the database, and when trying to retrieve it says there is a dimentionality mismatch in the embeddings

Dimensionality of (384) does not match with index dimensionality (1024)

So it seems hnswlib.py, which I assume is responsible for the dimensionality check, uses the defaulted embedding_function to get an embedding dimension number which it checks against the embedding database as a constraint before retrieval. As a result, without having a thorough look at the code, it would seem like including the embedding_function when restoring from an index might not recompute embeddings from the collection (which I understand is your major concern) but rather just check dimensionality as a constraint. It certainly seems that way from how fast the database gets restored when I initialize with the embedding_function.

@jeffchuber Any insights?

EDIT: Now I think about it, the embedding_function would be used in computing the embeddings of your query string to enable it compute the similarity search (cosine or euclidean distance, or whatever algorithm it uses) and that is likely the cause of the dimentioanlity violation in my case because it computes with the default SentenceTransformerEmbeddingFunction which I assume produces embeddings with a 384 dimensional space. This leads to the question, if you restore a database without the embedding_function, how would you expect duckdb to compute the embedding of a query string to perform a similarity search? Perhaps you can manually set the embedding function after index initialization?

KurtFeynmanGodel avatar Apr 20 '23 12:04 KurtFeynmanGodel

@KurtFeynmanGodel yes - I can help with this.

Chroma relies on the user to tell it how to embed things. This is not currently stored within chroma, which is why you have to pass it to Chroma whenever you do create_collection, get_collection, or get_or_create_collection.

What is happening here is that Chroma is using the default embedding function (sentence-transformers) to embed your document/query with 384 dimensionality. As this does not match the dimensionality of the embedding model you use 1024 - it throws an error.

jeffchuber avatar Apr 21 '23 15:04 jeffchuber

Hi @jeffchuber and team. Not sure if this issue is still being worked or if it's considered resolved, but I'm running into the same issues as the rest of the thread.

I'm using an AWS-hosted Chroma instance with persistence, and I can store & retrieve successfully in the same notebook.

But when I switch to a new notebook (eventually to be replaced by a web app) I can instantiate the AWS Chroma instance but receive what appears to be a very similar issue.

Exception: {"error":"NoIndexException('Index not found, please create an instance before querying')"}

I use db.persist() as advised, but still doesn't seem to work in a separate notebook.

saginawj avatar Apr 26 '23 19:04 saginawj

The same issue was resolved on my Ubuntu server by restarting my python app :smile:

Valdanitooooo avatar Apr 28 '23 07:04 Valdanitooooo

Hi @jeffchuber @murasz , I found the issue. So the issue is with the uuid creation and fetching. The issue of index not found was only happening in the case of custom collection_name. So you guys have used this mechanism to create uuid for the saved index files : uuid.uuid4(), which generates and gets a new uuid every time it is called.

ref file : chromadb/db/clickhouse.py , chromadb/db/duckdb.py

image

I instead changed the uuid mechanism while creating and fetching collections to do something like this using a seed as collection_name parameter which generates a consistent uuid for a fixed value of collection_name.

   rd = random.Random()
   rd.seed(collection_name)
   id = uuid.UUID(int=rd.getrandbits(128))

   collection_uuid = id

This generates a consistent UUID for a collection based on the name passed in the params. Do the same thing in get_collection_uuid_from_name() functions as well .

I tried this in various scenarios where it was failing for me, this solution works in those scenarios.

klisfer avatar May 06 '23 14:05 klisfer

same issue : raise NoIndexException("Index not found, please create an instance before querying") chromadb.errors.NoIndexException: Index not found, please create an instance before querying

gustavofelicidade avatar May 19 '23 16:05 gustavofelicidade

I am facing the same issue, I am using a notebook to create a persisted chroma db instance locally and then copy the data over to the server, my file structure is:

CleanShot 2023-05-29 at 16 39 18@2x

the file with the following code is on the same level as the db folder:

persist_directory = 'db'
embedding = OpenAIEmbeddings()
chroma_client = Chroma(persist_directory=persist_directory,
                       embedding_function=embedding)

what could be the bug here? My notebook also includes the following lines of code that load the data into the db, but i don't include in the server file for the obvious reasons. I am just trying to perform inference againt the DB.

vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)
vectordb.persist()

dankolesnikov avatar May 29 '23 23:05 dankolesnikov

@dankolesnikov did you create the Chroma DB outside LangChain and now you want to pass to LangChain?

jeffchuber avatar May 30 '23 17:05 jeffchuber

@jeffchuber Sorry I forgot to update my thread. I was able to resolve this error simply by upgrading chroma! I am not facing this error anymore but I uncovered another one and created a ticket on chroma's repo: https://github.com/chroma-core/chroma/issues/640

Please look into it if possible, I owe you and Anton a beer if it is something that I overlooked.

dankolesnikov avatar May 30 '23 17:05 dankolesnikov

@murasz can you try upgrading chromadb/langchain packages? if it works for you we could close this issue

dankolesnikov avatar May 30 '23 17:05 dankolesnikov

@dankolesnikov great! glad to hear it. will take a look at 640

jeffchuber avatar May 30 '23 17:05 jeffchuber

@murasz can you try upgrading chromadb/langchain packages? if it works for you we could close this issue

Hey. Thanks for the information. Let me check it again today.

murasz avatar May 31 '23 06:05 murasz

Not sure if it worked for @murasz but I can report that upgrading langchain ( pip install --upgrade langchain) worked for me.

Badrul-Goomblepop avatar Jun 01 '23 15:06 Badrul-Goomblepop

I haven't checked it yet due to my intensive schedule this week. Unless you won't let me check till monday, you can close the topic.

murasz avatar Jun 01 '23 17:06 murasz