chroma
chroma copied to clipboard
Update with duplicate IDs fails
Because add
allows duplicate / redundant IDs. When update
gets called and finds redundant IDs, it fails with an obtuse error.
This is an artifact of allowing redundant IDs, we shouldn't do this.
Clarification - is this the case of a single add() not checking for duplicates internal to a call, or separate adds each with the same id?
Not requiring unique IDs during add/update definitely leads to some initially confusing results. Is there some documentation on the rationale for allowing duplicate IDs?
Here is a simple repro to illustrate the current behavior:
from pprint import pprint
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection", get_or_create=True)
collection.add(
documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
metadatas=[
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
],
ids=["id1", "id2", "id1"],
)
print('\ncollection.get(limit=3) -- duplicate IDs and unique docs are preserved')
results = collection.get(limit=3)
pprint(results)
print('\nquery: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)
collection.update(
documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
metadatas=[
{"source": "new_source1"},
{"source": "new_source2"},
{"source": "new_source3"},
],
ids=["id1", "id2", "id1"],
)
print('\ncollection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)')
results = collection.get(limit=3)
pprint(results)
print('\nquery: "My Docs (3 distances, but 2 documents)"')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)
Output:
collection.get(limit=3) -- duplicate IDs and unique docs are preserved
{'documents': ['First doc (id1)',
'Second doc (id2)',
'Third doc (id1 - duplicate ID)'],
'embeddings': None,
'ids': ['id1', 'id2', 'id1'],
'metadatas': [{'source': 'my_source'},
{'source': 'my_source'},
{'source': 'my_source'}]}
query: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved
{'distances': [[1.1522818803787231, 1.246869444847107, 1.5678625106811523]],
'documents': [['First doc (id1)',
'Second doc (id2)',
'Third doc (id1 - duplicate ID)']],
'embeddings': None,
'ids': [['id1', 'id2', 'id1']],
'metadatas': [[{'source': 'my_source'},
{'source': 'my_source'},
{'source': 'my_source'}]]}
collection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)
{'documents': ['Second doc (id2)',
'Third doc (id1 - duplicate ID)',
'Third doc (id1 - duplicate ID)'],
'embeddings': None,
'ids': ['id2', 'id1', 'id1'],
'metadatas': [{'source': 'new_source2'},
{'source': 'new_source3'},
{'source': 'new_source3'}]}
query: "My Docs (3 distances, but 2 documents)"
{'distances': [[1.1522818803787231, 1.1522818803787231, 1.246869444847107]],
'documents': [['Third doc (id1 - duplicate ID)', 'Second doc (id2)']],
'embeddings': None,
'ids': [['id1', 'id2']],
'metadatas': [[{'source': 'new_source3'}, {'source': 'new_source2'}]]}
update() appears to delete items with duplicate ids. Is this intended behavior?
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")
collection.add(
embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id1"]
)
results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)
print(results)
collection.update(
embeddings=[[1.2, 2.3, 4.6], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "n_my_source"}, {"source": "n_my_source"}],
ids=["id1", "id1"]
)
results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)
print(results)
output:
Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
{'ids': [['id1', 'id1']], 'embeddings': None, 'documents': [['This is a document', 'This is another document']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[0.0, 87.14999389648438]]}
{'ids': [['id1']], 'embeddings': None, 'documents': [['This is another document']], 'metadatas': [[{'source': 'n_my_source'}]], 'distances': [[0.0, 0.00999998115003109]]}
@atroyn - I came to report the same thing as a bug:
This is an artifact of allowing redundant IDs, we shouldn't do this.
In my testing, add()
operations do not trigger any error when duplicates are added with the same key. For my use case, volumes are not high and I'm only inserting one row at a time. I'd love to see any one of these as methods I can use to make "adds" idempotent:
-
add()
raises an exception if theid
value already exists. (In which case, my code would catch the exception and then send anupdate()
call.) -
update()
has an option forcreate_if_missing
. (In which case, I'll just send everything to "update".) - A new method
add_or_update()
(orupsert()
) that handles checking if something exists before adding. (Same as above, but can be added while keeping existing methods unchanged.)
These are not mutually exclusive options - but any one of them would be sufficient for my immediate use case. (Today, I'm nuking the whole database and reloading because I don't have any good way to prevent duplicates or remove them once they're in the vectorstore.)
The impact for me is that I'm getting very bad retrievals for my LLM, since the top answer is returned several times instead of once, and my LLM can't benefit from contexts of the 3rd or 4th query result which might be suppressed by multiple copies of the first and second.
@aaronsteers, upsert
and additional validations are currently in progress. We're updating our test suite to make those changes robust, coming soon!
@atroyn - fantastic news! Thanks for the update!
Thanks for this! Been struggling with duplicates in our data store for a while. Happy to help test if needed.
Hi everyone,
-
add
will now fail with duplicate ids (pending today's release) -
upsert
is now also added
closing this, thanks for the help everyone