chroma icon indicating copy to clipboard operation
chroma copied to clipboard

ChromadB alters the order of metadata text compared to original

Open manitadayon opened this issue 1 year ago • 3 comments

What happened?

I am currently trying to embed a series of documents with ChromadB and I add id, embedding and metadata which is the original text to the vector database. So assume the original document is:

  1. The first car is blue.
  2. The second car is red.
  3. The third car is purple.

I save the embeddings, ids, and metadata which is the original questions above, however the first row of embedding and metadata corresponds to the 2) and not 1. Do you guys know if ChromaDB alters the rows/metadata during saving process?

Versions

ChromaDB version:0.5.3 OS: Linux Python: 3.10

Relevant log output

No response

manitadayon avatar Jul 21 '24 03:07 manitadayon

@manitadayon, we haven't seen this behaviour. Let me confirm your use case is similar to the code below:

import chromadb


client = chromadb.PersistentClient('test_2552')

collection = client.get_or_create_collection("test_2552")

collection.add(
    ids=["blue","red","purple"],
    documents=["The first car is blue","The second car is red","The third car is purple"],
    metadatas=[{"color":"blue"},{"color":"red"},{"color":"purple"}]
)

res=collection.query(query_texts=["Which car is red?"])

print(res)

And what you are experiencing is that the metadata, e.g., color for red car, somehow ends up in blue?

tazarov avatar Jul 21 '24 05:07 tazarov

No, more like this behavior, that you are passing the documents=["The first car is blue","The second car is red","The third car is purple"] as the input for the embedding and immediately you are saving it in ChromaDB using collection.add similar to what you have. Then I see the good embedding, meaning that each document is correctly embedded, however the metadata order and subsequently the embedding order is different from input, like ChromaDB changes the ordering while insertion in ChromaDB. So lets say I pass the following as an input with this ordering: documents=["The first car is blue","The second car is red","The third car is purple"]

What I get for the metadata is like this:

metadatas=[{"text":"The first car is red"},{"text":"The first car is purple"},{"text":"The first car is blue"}] What I add to ChromaDB is id, embeddings and metadata as follows:

import chromadb


client = chromadb.PersistentClient('test_2552')

collection = client.get_or_create_collection("test_2552")

collection.add(
    ids=["blue","red","purple"],
    embeddings, 
    metadatas=[{"text":"document"},{"text":"document"},{"text":"document"}]
)

manitadayon avatar Jul 21 '24 06:07 manitadayon

@manitadayon,

Ok so in theory the following code should be what you have:



import chromadb
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

ef = DefaultEmbeddingFunction()

documents=["The first car is blue","The second car is red","The third car is purple"]
embeddings = ef(documents)

client = chromadb.PersistentClient('test_2552_v2')

collection = client.get_or_create_collection("test_2552")

collection.add(
    ids=["blue","red","purple"],
    embeddings=embeddings,
    metadatas=[{"text":"The first car is blue"},{"text":"The second car is red"},{"text":"The third car is purple"}]
)

Running this for both get() and query() results in consistent ordering of metadata to match the correct id:

collection.get()

{'ids': ['blue', 'purple', 'red'],
 'embeddings': None,
 'metadatas': [{'text': 'The first car is blue'},
  {'text': 'The third car is purple'},
  {'text': 'The second car is red'}],
 'documents': [None, None, None],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}
collection.query(query_texts=["Which car is red?"])

{'ids': [['red', 'purple', 'blue']],
 'distances': [[0.4380978786045734, 0.7631484670171297, 0.7817838278402404]],
 'metadatas': [[{'text': 'The second car is red'},
   {'text': 'The third car is purple'},
   {'text': 'The first car is blue'}]],
 'embeddings': None,
 'documents': [[None, None, None]],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

tazarov avatar Jul 22 '24 18:07 tazarov

Closing due to inactivity for some time. @manitadayon if this is still a problem in Chroma v0.6.0 or later, feel free to open a new issue! We would need as much information as possible to reproduce the error.

itaismith avatar Jan 02 '25 22:01 itaismith