langchain icon indicating copy to clipboard operation
langchain copied to clipboard

False "Index not found" messages

Open francisjervis opened this issue 1 year ago • 6 comments

System Info

0.1173

Who can help?

No response

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

1: Create Chroma vectorstore 2: Persist vectorstore 3: Use vectorstore once 4: Vectorstore no longer works, says "Index not found"

Expected behavior

It works.

francisjervis avatar May 18 '23 21:05 francisjervis

between step 3 and 4 are you ending one process and starting a new one or is this two sequential calls to the vectorstore

dev2049 avatar May 19 '23 00:05 dev2049

New process. What's messed up is it's a new process between 2 and 3 too lol, the vectorstore exists on the disk but will not load.

francisjervis avatar May 19 '23 04:05 francisjervis

hm i was able to reproduce, and could only fix by specifying anonymized_telemetry=False in client settings (inspire by https://github.com/hwchase17/langchain/issues/2491#issuecomment-1499082189 from @sergerdn)

import chromadb

db = Chroma.from_documents(docs, embeddings, persist_directory=".chroma_db")
db.persist()

client_settings = chromadb.config.Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=".chroma_db",
    anonymized_telemetry=False,
)
load_db = Chroma(embedding_function=embeddings, client_settings=client_settings, persist_directory=".chroma_db")

@atroyn is this expected behavior? if so we can add docs on how to properly load persisted db. could also probs add a load class method to Chroma for convenience that handles any non-obvious configuration

dev2049 avatar May 19 '23 19:05 dev2049

related to #2490, #2491, #3011

dev2049 avatar May 19 '23 19:05 dev2049

oh i take it back, this actually does work for me

db = Chroma.from_documents(docs, embeddings, persist_directory=".chroma_db")
db.persist()
load_db = Chroma(embedding_function=embeddings, persist_directory=".chroma_db")
load_db.similarity_search_with_score("foo bar")

@francisjervis could you share a snippet that i can use to reproduce?

dev2049 avatar May 19 '23 19:05 dev2049

I cannot now even get it to run the first time. Step 1: create index

from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

persist_directory = 'chroma_ca_sources'

loader = DirectoryLoader('./catenantrightssources', glob="**/*.txt", loader_cls=TextLoader, show_progress=True)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator = "\n\n",
    chunk_size = 500,
    chunk_overlap  = 100,
    length_function = len,
)

split = text_splitter.split_documents(docs)
for s in split:
    print(s)

embedding = OpenAIEmbeddings(openai_api_key="sk-.........", model="text-embedding-ada-002")
vectordb = Chroma.from_documents(documents=split, embedding=embedding, persist_directory=persist_directory)
vectordb.persist()

Step 2: query

from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings

persist_directory = 'chroma_ca_sources'
embedding = OpenAIEmbeddings(openai_api_key="sk-...", model="text-embedding-ada-002")

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

retriever = vectordb.as_retriever()

query = "what is a lease?"

result = retriever.get_relevant_documents(query=query)

print(result)

The second script fails with raise NoIndexException("Index not found, please create an instance before querying") No, this is not a path error - it fails with absolute paths too.

francisjervis avatar May 20 '23 21:05 francisjervis

For what it's worth, I was seeing the same message (Index not found) even with an empty ChromaDB. I upgraded chromadb from 0.3.23 to 0.3.25 and that fixed the error for me. This commit is probably related.

skiyooka avatar May 24 '23 18:05 skiyooka

Hi, @francisjervis. I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about false "Index not found" messages when using a vector store in the OpenAI library. You provided steps to reproduce the issue and expected the vector store to work without any errors. In the comments, there was a discussion about the steps to reproduce the issue and potential fixes, such as specifying anonymized_telemetry=False in client settings or upgrading chromadb from 0.3.23 to 0.3.25.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain repository. Let us know if you have any further questions or concerns.

dosubot[bot] avatar Sep 12 '23 16:09 dosubot[bot]