langchain
langchain copied to clipboard
Avoiding recomputation of embeddings with Chroma
If I'm reading correctly, this is the function to add_texts to Chroma
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Run more texts through the embeddings and add to the vectorstore.
Args:
texts (Iterable[str]): Texts to add to the vectorstore.
metadatas (Optional[List[dict]], optional): Optional list of metadatas.
ids (Optional[List[str]], optional): Optional list of IDs.
Returns:
List[str]: List of IDs of the added texts.
"""
# TODO: Handle the case where the user doesn't provide ids on the Collection
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
embeddings = None
if self._embedding_function is not None:
embeddings = self._embedding_function.embed_documents(list(texts))
self._collection.add(
metadatas=metadatas, embeddings=embeddings, documents=texts, ids=ids
)
return ids
It does not seem to check if the texts are already inside the database. This means it's very easy to duplicate work when running indexing jobs incrementally. What's more, the Chroma class from langchain.vectorstores does not seem to expose functions to see if some text is already inside the vector store.
What's the preferred way of dealing with this? I can of course set up a separate db that keeps track of hashes of text inside the Chromadb, but this seems unnecessarily clunky and something that you'd expect the db to do for you.
Not a maintainer myself, but as someone dealing with a similar class of issues with the Chroma integration:
- Checking the contents of a populated Chroma instance:
- It's ugly, but you can access the underlying
_collection
property and use itsget
method to request subsets of the stored data based on id, metadata filtering, etc - I'm assuming metadata filtering is more optimized, but the
where_documents
arg can provide you text search over the stored document contents
- It's ugly, but you can access the underlying
- Enforcing idempotent document addition:
- Chroma itself states that their datastore will not enforce uniqueness even of the ids you provide to accompany documents. Unfortunately that probably means you're out of luck in terms of the db doing any of your work for you here. Even some scheme using text hash as the id wouldn't work.
- You could store text hash, or some other quick, deterministic identifier for a piece of text in the metadata and look for it with
_collection.get
prior to adding new documents. This at least means you don't have to track this information in a secondary database, but I'm not sure of the performance of this approach so a secondary database may be necessary anyway for your use case
Hi everyone, I work on Chroma.
Lack of uniqueness was a constraint of some previous architectural decisions we have made, but those are being rectified in a large refactor we are working on: https://github.com/chroma-core/chroma/pull/214
Hi, sounds like a pretty standard use case to me, and corresponding refactor has been stalled, or changed. Any idea other than implementing some text hash, as metadata ?
@murbard @jppaolim would upsert
work in this case? Chroma now supports upsert.
cc @ejdb00
Yes ! For my use case I feel that it will solve this issue ! Now in the context of documents query to be honest there is another pb which is the update of documents, but this goes further than this particular GitHub issue ;) thanks !
what does another pb
mean?
May be I should open another issue and/or comment on 560 ?
Use case is that I have a bunch of files I want to do QA over. But this file base evolves over time : either because of new files (easy case) or because some files are updated. So in order not to calculate all embeddings every time, I need to keep track of what kind of embeddings I have already calculated, remove the embeddings for the "chunks" that don't exist anymore etc... I wonder if I should start coding all that manually using chroma metadata or if some other solutions can help.
Thanks for the support in any case.
Gothca, that makes sense! I opened up an issue in chroma to track this https://github.com/chroma-core/chroma/issues/560
I think someone on our end is picking up pieces of it, let me know if you'd like to contribute!
Hi, @murbard! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about avoiding recomputation of embeddings with Chroma. The current function to add texts to Chroma does not check if the texts are already in the database, leading to duplication of work. The Chroma maintainer acknowledges the issue and mentions that a refactor is being worked on to rectify the lack of uniqueness constraint. They also suggest using the upsert
feature as a possible solution. Another user mentions a related issue regarding updating documents and the need to keep track of calculated embeddings. The Chroma maintainer opens a new issue to track this and invites contributions.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contributions to the LangChain repository!