chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Update with duplicate IDs fails

Open atroyn opened this issue 1 year ago • 7 comments

Because add allows duplicate / redundant IDs. When update gets called and finds redundant IDs, it fails with an obtuse error.

This is an artifact of allowing redundant IDs, we shouldn't do this.

atroyn avatar Apr 05 '23 19:04 atroyn

Clarification - is this the case of a single add() not checking for duplicates internal to a call, or separate adds each with the same id?

HammadB avatar Apr 05 '23 19:04 HammadB

Not requiring unique IDs during add/update definitely leads to some initially confusing results. Is there some documentation on the rationale for allowing duplicate IDs?

Here is a simple repro to illustrate the current behavior:

from pprint import pprint
import chromadb

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="my_collection", get_or_create=True)

collection.add(
    documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
    metadatas=[
        {"source": "my_source"},
        {"source": "my_source"},
        {"source": "my_source"},
    ],
    ids=["id1", "id2", "id1"],
)

print('\ncollection.get(limit=3) -- duplicate IDs and unique docs are preserved')
results = collection.get(limit=3)
pprint(results)

print('\nquery: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)

collection.update(
    documents=["First doc (id1)", "Second doc (id2)", "Third doc (id1 - duplicate ID)"],
    metadatas=[
        {"source": "new_source1"},
        {"source": "new_source2"},
        {"source": "new_source3"},
    ],
    ids=["id1", "id2", "id1"],
)

print('\ncollection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)')
results = collection.get(limit=3)
pprint(results)

print('\nquery: "My Docs (3 distances, but 2 documents)"')
results = collection.query(query_texts=["My docs"], n_results=3)
pprint(results)

Output:

collection.get(limit=3) -- duplicate IDs and unique docs are preserved
{'documents': ['First doc (id1)',
               'Second doc (id2)',
               'Third doc (id1 - duplicate ID)'],
 'embeddings': None,
 'ids': ['id1', 'id2', 'id1'],
 'metadatas': [{'source': 'my_source'},
               {'source': 'my_source'},
               {'source': 'my_source'}]}

query: "My Docs (3 results)" -- duplicate IDs and unique docs are preserved
{'distances': [[1.1522818803787231, 1.246869444847107, 1.5678625106811523]],
 'documents': [['First doc (id1)',
                'Second doc (id2)',
                'Third doc (id1 - duplicate ID)']],
 'embeddings': None,
 'ids': [['id1', 'id2', 'id1']],
 'metadatas': [[{'source': 'my_source'},
                {'source': 'my_source'},
                {'source': 'my_source'}]]}

collection.get(limit=3) (3 documents, but First doc (id1) is gone because of update)
{'documents': ['Second doc (id2)',
               'Third doc (id1 - duplicate ID)',
               'Third doc (id1 - duplicate ID)'],
 'embeddings': None,
 'ids': ['id2', 'id1', 'id1'],
 'metadatas': [{'source': 'new_source2'},
               {'source': 'new_source3'},
               {'source': 'new_source3'}]}

query: "My Docs (3 distances, but 2 documents)"
{'distances': [[1.1522818803787231, 1.1522818803787231, 1.246869444847107]],
 'documents': [['Third doc (id1 - duplicate ID)', 'Second doc (id2)']],
 'embeddings': None,
 'ids': [['id1', 'id2']],
 'metadatas': [[{'source': 'new_source3'}, {'source': 'new_source2'}]]}

PaulMest avatar Apr 05 '23 22:04 PaulMest

update() appears to delete items with duplicate ids. Is this intended behavior?

import chromadb

chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="my_collection")

collection.add(
embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id1"]
)

results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)
print(results)

collection.update(
embeddings=[[1.2, 2.3, 4.6], [6.7, 8.2, 9.2]],
documents=["This is a document", "This is another document"],
metadatas=[{"source": "n_my_source"}, {"source": "n_my_source"}],
ids=["id1", "id1"]
)

results = collection.query(
query_embeddings=[1.2, 2.3, 4.5],
n_results=2
)

print(results)

output:

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
{'ids': [['id1', 'id1']], 'embeddings': None, 'documents': [['This is a document', 'This is another document']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[0.0, 87.14999389648438]]}
{'ids': [['id1']], 'embeddings': None, 'documents': [['This is another document']], 'metadatas': [[{'source': 'n_my_source'}]], 'distances': [[0.0, 0.00999998115003109]]}

theoskille avatar Apr 11 '23 18:04 theoskille

@atroyn - I came to report the same thing as a bug:

This is an artifact of allowing redundant IDs, we shouldn't do this.

In my testing, add() operations do not trigger any error when duplicates are added with the same key. For my use case, volumes are not high and I'm only inserting one row at a time. I'd love to see any one of these as methods I can use to make "adds" idempotent:

  1. add() raises an exception if the id value already exists. (In which case, my code would catch the exception and then send an update() call.)
  2. update() has an option for create_if_missing. (In which case, I'll just send everything to "update".)
  3. A new method add_or_update() (or upsert()) that handles checking if something exists before adding. (Same as above, but can be added while keeping existing methods unchanged.)

These are not mutually exclusive options - but any one of them would be sufficient for my immediate use case. (Today, I'm nuking the whole database and reloading because I don't have any good way to prevent duplicates or remove them once they're in the vectorstore.)

The impact for me is that I'm getting very bad retrievals for my LLM, since the top answer is returned several times instead of once, and my LLM can't benefit from contexts of the 3rd or 4th query result which might be suppressed by multiple copies of the first and second.

aaronsteers avatar Apr 12 '23 17:04 aaronsteers

@aaronsteers, upsert and additional validations are currently in progress. We're updating our test suite to make those changes robust, coming soon!

atroyn avatar Apr 12 '23 20:04 atroyn

@atroyn - fantastic news! Thanks for the update!

aaronsteers avatar Apr 12 '23 20:04 aaronsteers

Thanks for this! Been struggling with duplicates in our data store for a while. Happy to help test if needed.

timothyasp avatar Apr 13 '23 16:04 timothyasp

Hi everyone,

  • add will now fail with duplicate ids (pending today's release)
  • upsert is now also added

closing this, thanks for the help everyone

jeffchuber avatar May 08 '23 17:05 jeffchuber