ChromadB alters the order of metadata text compared to original
What happened?
I am currently trying to embed a series of documents with ChromadB and I add id, embedding and metadata which is the original text to the vector database. So assume the original document is:
- The first car is blue.
- The second car is red.
- The third car is purple.
I save the embeddings, ids, and metadata which is the original questions above, however the first row of embedding and metadata corresponds to the 2) and not 1. Do you guys know if ChromaDB alters the rows/metadata during saving process?
Versions
ChromaDB version:0.5.3 OS: Linux Python: 3.10
Relevant log output
No response
@manitadayon, we haven't seen this behaviour. Let me confirm your use case is similar to the code below:
import chromadb
client = chromadb.PersistentClient('test_2552')
collection = client.get_or_create_collection("test_2552")
collection.add(
ids=["blue","red","purple"],
documents=["The first car is blue","The second car is red","The third car is purple"],
metadatas=[{"color":"blue"},{"color":"red"},{"color":"purple"}]
)
res=collection.query(query_texts=["Which car is red?"])
print(res)
And what you are experiencing is that the metadata, e.g., color for red car, somehow ends up in blue?
No, more like this behavior, that you are passing the documents=["The first car is blue","The second car is red","The third car is purple"] as the input for the embedding and immediately you are saving it in ChromaDB using collection.add similar to what you have. Then I see the good embedding, meaning that each document is correctly embedded, however the metadata order and subsequently the embedding order is different from input, like ChromaDB changes the ordering while insertion in ChromaDB.
So lets say I pass the following as an input with this ordering:
documents=["The first car is blue","The second car is red","The third car is purple"]
What I get for the metadata is like this:
metadatas=[{"text":"The first car is red"},{"text":"The first car is purple"},{"text":"The first car is blue"}]
What I add to ChromaDB is id, embeddings and metadata as follows:
import chromadb
client = chromadb.PersistentClient('test_2552')
collection = client.get_or_create_collection("test_2552")
collection.add(
ids=["blue","red","purple"],
embeddings,
metadatas=[{"text":"document"},{"text":"document"},{"text":"document"}]
)
@manitadayon,
Ok so in theory the following code should be what you have:
import chromadb
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction
ef = DefaultEmbeddingFunction()
documents=["The first car is blue","The second car is red","The third car is purple"]
embeddings = ef(documents)
client = chromadb.PersistentClient('test_2552_v2')
collection = client.get_or_create_collection("test_2552")
collection.add(
ids=["blue","red","purple"],
embeddings=embeddings,
metadatas=[{"text":"The first car is blue"},{"text":"The second car is red"},{"text":"The third car is purple"}]
)
Running this for both get() and query() results in consistent ordering of metadata to match the correct id:
collection.get()
{'ids': ['blue', 'purple', 'red'],
'embeddings': None,
'metadatas': [{'text': 'The first car is blue'},
{'text': 'The third car is purple'},
{'text': 'The second car is red'}],
'documents': [None, None, None],
'uris': None,
'data': None,
'included': ['metadatas', 'documents']}
collection.query(query_texts=["Which car is red?"])
{'ids': [['red', 'purple', 'blue']],
'distances': [[0.4380978786045734, 0.7631484670171297, 0.7817838278402404]],
'metadatas': [[{'text': 'The second car is red'},
{'text': 'The third car is purple'},
{'text': 'The first car is blue'}]],
'embeddings': None,
'documents': [[None, None, None]],
'uris': None,
'data': None,
'included': ['metadatas', 'documents', 'distances']}
Closing due to inactivity for some time. @manitadayon if this is still a problem in Chroma v0.6.0 or later, feel free to open a new issue! We would need as much information as possible to reproduce the error.