langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Issue: Chroma.from_documents does not save metadata properly

Open YanaSSS opened this issue 8 months ago • 5 comments

Issue you'd like to raise.

I create my vector db using the following code:

db = Chroma.from_documents(
            chunked_documents,
            embeddings,
            persist_directory=db_path + '/' + db_type,
            client_settings=chroma_settings,)

chunked_documents is a list of elements of type Document. I have added metadata which is a simple numerical id: {'id': 1}

embeddings = HuggingFaceInstructEmbeddings(
        model_name=args.embedding_model,
        model_kwargs={"device": args.device},
    )
CHROMA_SETTINGS = Settings(
    anonymized_telemetry=False,
    is_persistent=True,
)

What happens is that I run db.similarity_search(query, k=3) and for part of the answers, the metadata dict is empty. Has anyone encountered such an issue?

Just to point out, when I create the db using the from_texts() method where I add raw texts and metadata separately I do not encounter the issue and when running db.similarity_search() the returned answer, contains the respective metadata.

Suggestion:

No response

YanaSSS avatar Nov 28 '23 12:11 YanaSSS

🤖

Hello @YanaSSS!

I'm Dosu, a bot here to assist you with any questions, issues, or contributions you might have in relation to LangChain. While we're waiting for a human maintainer to respond to your issue, I'm here to help guide you through the process. Nice to meet you!

Based on the information you've provided, it seems like the issue you're encountering is related to how the Chroma.from_documents method in LangChain handles metadata.

In the Chroma.from_documents method, if the metadatas argument is provided, the method checks for any discrepancies in the length between uris (images) and metadatas. If there is a difference, it fills the metadatas list with empty dictionaries to match the length of uris. Then, it separates the indices of empty and non-empty metadata into empty_ids and non_empty_ids respectively. For non-empty metadata, it performs an upsert operation to add the images, embeddings, and metadata to the collection. However, for the images with empty metadata, it only upserts the images and embeddings, but not the metadata. This is why you are seeing an empty metadata dict for some answers when running db.similarity_search(query, k=3).

On the other hand, the from_texts() method does not have this issue because it does not separate the metadata into empty and non-empty, and it upserts the metadata regardless of whether it is empty or not.

To fix this issue, you should modify the from_documents method to also upsert the empty metadata, similar to how the from_texts() method does it. You can find the relevant code in the libs/langchain/langchain/vectorstores/chroma.py file in the LangChain repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources

dosubot[bot] avatar Nov 28 '23 12:11 dosubot[bot]

@YanaSSS please specify your chromadb, langchain, python versions

umair313 avatar Nov 29 '23 12:11 umair313

chromadb==0.4.6 langchain==0.0.267 Python 3.10.8

YanaSSS avatar Dec 01 '23 12:12 YanaSSS

Issue you'd like to raise.

I create my vector db using the following code:

db = Chroma.from_documents(
            chunked_documents,
            embeddings,
            persist_directory=db_path + '/' + db_type,
            client_settings=chroma_settings,)

chunked_documents is a list of elements of type Document. I have added metadata which is a simple numerical id: {'id': 1}

embeddings = HuggingFaceInstructEmbeddings(
        model_name=args.embedding_model,
        model_kwargs={"device": args.device},
    )
CHROMA_SETTINGS = Settings(
    anonymized_telemetry=False,
    is_persistent=True,
)

What happens is that I run db.similarity_search(query, k=3) and for part of the answers, the metadata dict is empty. Has anyone encountered such an issue?

Just to point out, when I create the db using the from_texts() method where I add raw texts and metadata separately I do not encounter the issue and when running db.similarity_search() the returned answer, contains the respective metadata.

Suggestion:

No response

I am dealing with the opposite problem somehow.

from_documents() preserves any metadata I add, but from_texts() which has a parameter for metadata doesn't persist correctly, and when I instantiate a Chroma client, nothing is retrieved.

ccmilne avatar Dec 01 '23 22:12 ccmilne

Shouldn't the metadata value type be str, not int?

i-mprovising avatar Jan 25 '24 04:01 i-mprovising