langchain
langchain copied to clipboard
Chroma VectorStore document cannot be updated
System Info
Given how chroma results are converted to Documents, I don't think it's possible to update those documents since the id is not stored,
Here is the current implementation
Would it make sense to add the id into the document metadata?
Who can help?
@jeffchuber @claust
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [ ] Document Loaders
- [X] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
This is a design question rather than a bug. Any request such as similarity_search returns List[Document] but these documents don't contain the original chroma uuid.
Expected behavior
Some way to be able to change the metadata of a document and store the changes in chroma, even if it isn't part of the VectorStore interface.
@adieyal - this would be great! it would also allow easy updates, deletions, etc.
@hwchase17 what are your thoughts?
In case it's useful to anyone, I've temporarily patched _results_to_docs_and_scores
to include the chroma id, e.g.
from unittest.mock import patch
from langchain.vectorstores import Chroma, chroma
from langchain.schema import Document
def my_results_to_docs_and_scores(results: Any) -> list[tuple[Document, float]]:
"""
A function to monkeypatch langchains results_to_docs_and_scores function to include the original chroma_id
"""
return [
(
Document(
page_content=result[0],
metadata={**result[1], "_chroma_id": UUID(result[3])},
),
result[2],
)
for result in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
results["ids"][0],
)
]
and then
vectorstore = Chroma(...)
with patch.object(
chroma,
"_results_to_docs_and_scores",
return_value=my_results_to_docs_and_scores,
):
return vectorstore._collection.get(
include=["metadatas", "documents", "embeddings"],
where=where or {},
)
the original chroma id is then stored in mydoc.metadata["_chroma_id"]
id like to help here!
@adieyal should we store the user-ids (eg https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py#L155) or the chroma ids? the user ids seem more useful to the user for updating.
Based on that comment, I think I was referring to user-ids - I didn't realise there was a difference.
@adieyal there are the ids the user provides to the database (for tracking purposes) and then chroma also generates a uuid
for each embedding - https://github.com/chroma-core/chroma/blob/main/chromadb/db/clickhouse.py#L31
i think the user provided ids (in the langchain case, if not provided are autogenerated) make the most sense to store LangChain side
Hi, @adieyal! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about the inability to update Chroma VectorStore documents because the document ID is not stored. There has been a discussion with @jeffchuber and @hwchase17, where @jeffchuber offered to help and asked about storing user-ids or chroma ids. You clarified that you were referring to user-ids, and @jeffchuber explained the difference between user-ids and chroma ids. It seems that the resolution to this issue is to store user-ids in the document metadata to enable updates to be stored in Chroma.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository!