langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Chroma VectorStore document cannot be updated

Open adieyal opened this issue 1 year ago • 5 comments

System Info

Given how chroma results are converted to Documents, I don't think it's possible to update those documents since the id is not stored,

Here is the current implementation

Would it make sense to add the id into the document metadata?

Who can help?

@jeffchuber @claust

Information

  • [ ] The official example notebooks/scripts
  • [ ] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

This is a design question rather than a bug. Any request such as similarity_search returns List[Document] but these documents don't contain the original chroma uuid.

Expected behavior

Some way to be able to change the metadata of a document and store the changes in chroma, even if it isn't part of the VectorStore interface.

adieyal avatar May 06 '23 11:05 adieyal

@adieyal - this would be great! it would also allow easy updates, deletions, etc.

@hwchase17 what are your thoughts?

jeffchuber avatar May 13 '23 21:05 jeffchuber

In case it's useful to anyone, I've temporarily patched _results_to_docs_and_scores to include the chroma id, e.g.

from unittest.mock import patch
from langchain.vectorstores import Chroma, chroma
from langchain.schema import Document

def my_results_to_docs_and_scores(results: Any) -> list[tuple[Document, float]]:
    """
    A function to monkeypatch langchains results_to_docs_and_scores function to include the original chroma_id
    """

    return [
        (
            Document(
                page_content=result[0],
                metadata={**result[1], "_chroma_id": UUID(result[3])},
            ),
            result[2],
        )
        for result in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
            results["ids"][0],
        )
    ]

and then

vectorstore = Chroma(...)

with patch.object(
        chroma,
        "_results_to_docs_and_scores",
        return_value=my_results_to_docs_and_scores,
    ):
        return vectorstore._collection.get(
            include=["metadatas", "documents", "embeddings"],
            where=where or {},
        )

the original chroma id is then stored in mydoc.metadata["_chroma_id"]

adieyal avatar May 15 '23 08:05 adieyal

id like to help here!

@adieyal should we store the user-ids (eg https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py#L155) or the chroma ids? the user ids seem more useful to the user for updating.

jeffchuber avatar May 15 '23 17:05 jeffchuber

Based on that comment, I think I was referring to user-ids - I didn't realise there was a difference.

adieyal avatar May 15 '23 19:05 adieyal

@adieyal there are the ids the user provides to the database (for tracking purposes) and then chroma also generates a uuid for each embedding - https://github.com/chroma-core/chroma/blob/main/chromadb/db/clickhouse.py#L31

i think the user provided ids (in the langchain case, if not provided are autogenerated) make the most sense to store LangChain side

jeffchuber avatar May 16 '23 03:05 jeffchuber

Hi, @adieyal! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about the inability to update Chroma VectorStore documents because the document ID is not stored. There has been a discussion with @jeffchuber and @hwchase17, where @jeffchuber offered to help and asked about storing user-ids or chroma ids. You clarified that you were referring to user-ids, and @jeffchuber explained the difference between user-ids and chroma ids. It seems that the resolution to this issue is to store user-ids in the document metadata to enable updates to be stored in Chroma.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

dosubot[bot] avatar Sep 12 '23 16:09 dosubot[bot]