haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

Reusing ChromaDocumentStore from disk throws an error

Open savank7 opened this issue 8 months ago • 2 comments

I’ve stored a ChromaDocumentStore locally using store.py, and it works perfectly—creating and persisting the DB as expected.

However, when I try to reuse this persisted ChromaDocumentStore in query.py, I encounter an error.

store.py - Used to create and persist the ChromaDocumentStore

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

file_paths = ["data" / Path(name) for name in os.listdir("data")]

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore(persist_path="./chroma_db_test", collection_name="my_documents")

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

print('done')

query.py – Trying to use the persisted ChromaDocumentStore

from haystack import Pipeline
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

import os

os.environ["OPENAI_API_KEY"] = "my-key"

# Load the document store from persisted DB
# This prevents Chroma from wiping the existing DB and recreating it
document_store = ChromaDocumentStore(persist_path="./chroma_db_test", collection_name="my_documents")

prompt = [
    ChatMessage.from_user(
      """
      According to the contents of this website:
      {% for document in documents %}
        {{document.content}}
      {% endfor %}
      Answer the given question: {{query}}
      Answer:
      """
    )
]

prompt_builder = ChatPromptBuilder(template=prompt)
llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)

querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)

querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")

query = "How to apply discount before tax in POS ?"
results = querying.run(data={"retriever": {"query": query},
                        "prompt_builder": {"query": query}})

print(results["llm"]["replies"][0].text)

Error from the query.py code

haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'retriever'
Component type: 'ChromaQueryTextRetriever'
Error: Collection [my_documents] already exists

🔍 Problem I want to reuse the existing vector DB (chroma_db_test) without recreating it every time. However, the query.py script throws an error when trying to load the stored ChromaDocumentStore.

💬 Request Can you please help me correctly load and reuse the existing Chroma vector DB? I want to avoid re-indexing or wiping the DB each time I run the query pipeline.

Thanks in advance!

savank7 avatar Apr 07 '25 10:04 savank7

Hello @savank7 and thank you for reporting this issue. Our team most likely won't have capacity to address the issue next week but we'll get to it as soon as we can. In the meantime, if there is any update from your side, please let us know. Thank you for your patience.

julian-risch avatar Apr 11 '25 13:04 julian-risch

@julian-risch I was having the same issue described here, however it seems as if you install the current version via Git directly, the problematic code is fixed and loading from disk works ok. It just seems the version currently on PyPi has this issue, so it would just need to be rebuilt and pushed to pypi.

jkcarney avatar Apr 14 '25 15:04 jkcarney

It was fixed in 3.1.0. (See https://github.com/deepset-ai/haystack/discussions/9183)

anakin87 avatar May 19 '25 15:05 anakin87