langchain Avoiding recomputation of embeddings with Chroma

If I'm reading correctly, this is the function to add_texts to Chroma

 def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        ids: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> List[str]:
        """Run more texts through the embeddings and add to the vectorstore.

        Args:
            texts (Iterable[str]): Texts to add to the vectorstore.
            metadatas (Optional[List[dict]], optional): Optional list of metadatas.
            ids (Optional[List[str]], optional): Optional list of IDs.

        Returns:
            List[str]: List of IDs of the added texts.
        """
        # TODO: Handle the case where the user doesn't provide ids on the Collection
        if ids is None:
            ids = [str(uuid.uuid1()) for _ in texts]
        embeddings = None
        if self._embedding_function is not None:
            embeddings = self._embedding_function.embed_documents(list(texts))
        self._collection.add(
            metadatas=metadatas, embeddings=embeddings, documents=texts, ids=ids
        )
        return ids

It does not seem to check if the texts are already inside the database. This means it's very easy to duplicate work when running indexing jobs incrementally. What's more, the Chroma class from langchain.vectorstores does not seem to expose functions to see if some text is already inside the vector store.

What's the preferred way of dealing with this? I can of course set up a separate db that keeps track of hashes of text inside the Chromadb, but this seems unnecessarily clunky and something that you'd expect the db to do for you.

Mar 20 '23 17:03 murbard

Not a maintainer myself, but as someone dealing with a similar class of issues with the Chroma integration:

Checking the contents of a populated Chroma instance:
- It's ugly, but you can access the underlying _collection property and use its get method to request subsets of the stored data based on id, metadata filtering, etc
- I'm assuming metadata filtering is more optimized, but the where_documents arg can provide you text search over the stored document contents
Enforcing idempotent document addition:
- Chroma itself states that their datastore will not enforce uniqueness even of the ids you provide to accompany documents. Unfortunately that probably means you're out of luck in terms of the db doing any of your work for you here. Even some scheme using text hash as the id wouldn't work.
- You could store text hash, or some other quick, deterministic identifier for a piece of text in the metadata and look for it with _collection.get prior to adding new documents. This at least means you don't have to track this information in a secondary database, but I'm not sure of the performance of this approach so a secondary database may be necessary anyway for your use case

Mar 21 '23 21:03 ejdb00

Hi everyone, I work on Chroma.

Lack of uniqueness was a constraint of some previous architectural decisions we have made, but those are being rectified in a large refactor we are working on: https://github.com/chroma-core/chroma/pull/214

Mar 29 '23 22:03 jeffchuber

Hi, sounds like a pretty standard use case to me, and corresponding refactor has been stalled, or changed. Any idea other than implementing some text hash, as metadata ?

May 14 '23 18:05 jppaolim

@murbard @jppaolim would upsert work in this case? Chroma now supports upsert.

cc @ejdb00

May 15 '23 17:05 jeffchuber

Yes ! For my use case I feel that it will solve this issue ! Now in the context of documents query to be honest there is another pb which is the update of documents, but this goes further than this particular GitHub issue ;) thanks !

May 15 '23 19:05 jppaolim

what does another pb mean?

May 16 '23 03:05 jeffchuber

May be I should open another issue and/or comment on 560 ?

Use case is that I have a bunch of files I want to do QA over. But this file base evolves over time : either because of new files (easy case) or because some files are updated. So in order not to calculate all embeddings every time, I need to keep track of what kind of embeddings I have already calculated, remove the embeddings for the "chunks" that don't exist anymore etc... I wonder if I should start coding all that manually using chroma metadata or if some other solutions can help.
Thanks for the support in any case.

May 17 '23 10:05 jppaolim

Gothca, that makes sense! I opened up an issue in chroma to track this https://github.com/chroma-core/chroma/issues/560

I think someone on our end is picking up pieces of it, let me know if you'd like to contribute!

May 17 '23 16:05 jeffchuber

Hi, @murbard! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about avoiding recomputation of embeddings with Chroma. The current function to add texts to Chroma does not check if the texts are already in the database, leading to duplication of work. The Chroma maintainer acknowledges the issue and mentions that a refactor is being worked on to rectify the lack of uniqueness constraint. They also suggest using the upsert feature as a possible solution. Another user mentions a related issue regarding updating documents and the need to keep track of calculated embeddings. The Chroma maintainer opens a new issue to track this and invites contributions.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LangChain repository!

Sep 21 '23 16:09 dosubot[bot]

langchain langchain copied to clipboard

Avoiding recomputation of embeddings with Chroma

langchain
langchain copied to clipboard