chroma Update with duplicate IDs fails

Because add allows duplicate / redundant IDs. When update gets called and finds redundant IDs, it fails with an obtuse error.

This is an artifact of allowing redundant IDs, we shouldn't do this.

Apr 05 '23 19:04 atroyn

Clarification - is this the case of a single add() not checking for duplicates internal to a call, or separate adds each with the same id?

Apr 05 '23 19:04 HammadB

Not requiring unique IDs during add/update definitely leads to some initially confusing results. Is there some documentation on the rationale for allowing duplicate IDs?

Here is a simple repro to illustrate the current behavior:

from pprint import pprint
import chromadb

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="my_collection", get_or_create=True)

collection.add(
    documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
    metadatas=[
        {"source": "my_source"},
        {"source": "my_source"},
        {"source": "my_source"},
    ],
    ids=["id1", "id2", "id1"],
)

print('\ncollection.get(limit=3) -- duplicate IDs and unique docs are preserved')
results = collection.get(limit=3)
pprint(results)

print('\nquery: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)

collection.update(
    documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
    metadatas=[
        {"source": "new_source1"},
        {"source": "new_source2"},
        {"source": "new_source3"},
    ],
    ids=["id1", "id2", "id1"],
)

print('\ncollection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)')
results = collection.get(limit=3)
pprint(results)

print('\nquery: "My Docs (3 distances, but 2 documents)"')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)

Output:

collection.get(limit=3) -- duplicate IDs and unique docs are preserved
{'documents': ['First doc (id1)',
               'Second doc (id2)',
               'Third doc (id1 - duplicate ID)'],
 'embeddings': None,
 'ids': ['id1', 'id2', 'id1'],
 'metadatas': [{'source': 'my_source'},
               {'source': 'my_source'},
               {'source': 'my_source'}]}

query: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved
{'distances': [[1.1522818803787231, 1.246869444847107, 1.5678625106811523]],
 'documents': [['First doc (id1)',
                'Second doc (id2)',
                'Third doc (id1 - duplicate ID)']],
 'embeddings': None,
 'ids': [['id1', 'id2', 'id1']],
 'metadatas': [[{'source': 'my_source'},
                {'source': 'my_source'},
                {'source': 'my_source'}]]}

collection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)
{'documents': ['Second doc (id2)',
               'Third doc (id1 - duplicate ID)',
               'Third doc (id1 - duplicate ID)'],
 'embeddings': None,
 'ids': ['id2', 'id1', 'id1'],
 'metadatas': [{'source': 'new_source2'},
               {'source': 'new_source3'},
               {'source': 'new_source3'}]}

query: "My Docs (3 distances, but 2 documents)"
{'distances': [[1.1522818803787231, 1.1522818803787231, 1.246869444847107]],
 'documents': [['Third doc (id1 - duplicate ID)', 'Second doc (id2)']],
 'embeddings': None,
 'ids': [['id1', 'id2']],
 'metadatas': [[{'source': 'new_source3'}, {'source': 'new_source2'}]]}

Apr 05 '23 22:04 PaulMest

update() appears to delete items with duplicate ids. Is this intended behavior?

import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")

collection.add(
embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id1"]
)

results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)
print(results)

collection.update(
embeddings=[[1.2, 2.3, 4.6], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "n_my_source"}, {"source": "n_my_source"}],
ids=["id1", "id1"]
)

results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)

print(results)

output:

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
{'ids': [['id1', 'id1']], 'embeddings': None, 'documents': [['This is a document', 'This is another document']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[0.0, 87.14999389648438]]}
{'ids': [['id1']], 'embeddings': None, 'documents': [['This is another document']], 'metadatas': [[{'source': 'n_my_source'}]], 'distances': [[0.0, 0.00999998115003109]]}

Apr 11 '23 18:04 theoskille

@atroyn - I came to report the same thing as a bug:

This is an artifact of allowing redundant IDs, we shouldn't do this.

In my testing, add() operations do not trigger any error when duplicates are added with the same key. For my use case, volumes are not high and I'm only inserting one row at a time. I'd love to see any one of these as methods I can use to make "adds" idempotent:

add() raises an exception if the id value already exists. (In which case, my code would catch the exception and then send an update() call.)
update() has an option for create_if_missing. (In which case, I'll just send everything to "update".)
A new method add_or_update() (or upsert()) that handles checking if something exists before adding. (Same as above, but can be added while keeping existing methods unchanged.)

These are not mutually exclusive options - but any one of them would be sufficient for my immediate use case. (Today, I'm nuking the whole database and reloading because I don't have any good way to prevent duplicates or remove them once they're in the vectorstore.)

The impact for me is that I'm getting very bad retrievals for my LLM, since the top answer is returned several times instead of once, and my LLM can't benefit from contexts of the 3rd or 4th query result which might be suppressed by multiple copies of the first and second.