chroma NoIndexException: Index not found when initializing Chroma from a persisted directory

I am facing a problem when trying to use the Chroma vector store with a persisted index. I have already loaded a document, created embeddings for it, and saved those embeddings in Chroma. The script ran perfectly with LLM and also created the necessary files in the persistence directory (.chroma\index). The files include:

chroma-collections.parquet chroma-embeddings.parquet id_to_uuid_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl index_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.bin index_metadata_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl uuid_to_id_3508d87c-12d1-4bbe-ae7f-69a0ec3c6616.pkl

However, when I try to initialize the Chroma instance using the persist_directory to utilize the previously saved embeddings, I encounter a NoIndexException error, stating "Index not found, please create an instance before querying".

Here is a snippet of the code I am using in a Jupyter notebook:

# Section 1
import os
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

# Load environment variables
%reload_ext dotenv
%dotenv info.env
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Section 2 - Initialize Chroma without an embedding function
persist_directory = '.chroma\\index'
db = Chroma(persist_directory=persist_directory)

# Section 3
# Load chat model and question answering chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

# Section 4
# Run the chain on a sample query
query = "The Question - Can you also cite the information you give after your answer?"
docs = db.similarity_search(query)
response = chain.run(input_documents=docs, question=query)
print(response)

Please help me understand what might be causing this problem and suggest possible solutions. Additionally, I am curious if these pre-existing embeddings could be reused without incurring the same cost for generating Ada embeddings again, as the documents I am working with have lots of pages. Thanks in advance!

Apr 17 '23 20:04 murasz

@murasz would you be open to sharing some code and/or data with our team so we can reproduce this and help debug? if so, please email us at [email protected] (that email goes to me). thanks!

Apr 19 '23 04:04 jeffchuber

@murasz would you be open to sharing some code and/or data with our team so we can reproduce this and help debug? if so, please email us at [email protected] (that email goes to me). thanks!

Hi @jeffchuber I answered the question in Langchain repo, so I'm just copying the same msg there. Please let me know if anything else is needed, I'd be glad to cooperate:

Hi Jeff, I noticed your message on ChromaD's repository. If you need anything else beyond the code I shared, please let me know. I may also send you the created bin and pk files via email.

Just to update you, the first code I used to convert txt files from PDF, divide them into chunks, and create embeddings worked perfectly. I was able to get answers without any problems. However, when I tried to use the whole embedded data again, I encountered the error message "NoIndexException: Index not found when initializing Chroma from a persisted directory."

My main concern is that I don't want to embed the data each time before answering a question. The reason is that processing large PDFs could end up costing a lot in the long term and would not be cost-effective for my purposes. Therefore, I'm open to any solutions you may have.

Let me know if you need any additional information or if you have any suggestions.

Apr 19 '23 11:04 murasz

@murasz were you able to solve this? closing this issue, but happy to re-open

May 08 '23 17:05 jeffchuber

Sure, you can. Thanks for your support.

May 08 '23 17:05 murasz

chroma chroma copied to clipboard

NoIndexException: Index not found when initializing Chroma from a persisted directory

chroma
chroma copied to clipboard