chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Re-inserting records leads to log messages on every subsequent operation

Open mattpovey opened this issue 1 year ago • 13 comments

What happened?

Reinserting records without embeddings (i.e. requiring Chromadb to generate the embeddings) causes them to be held in the embeddings_queue table of chromadb.sqlite3. On every subsequent operation, log messages are presented as chroma (presumably) attempts to insert the already existing records:

Add of existing embedding ID: 21
Insert of existing embedding ID: 21

To replicate, attempt to re-insert a record keeping the id, documents and metadata identical. If done n times (e.g. via a broken loop that increments record numbers incorrectly, which is how I did it), the row is added n times to the embeddings_queue table.

The issue is easily worked-around by deleting the records from embeddings_queue:

cur.execute("DELETE FROM embeddings_queue;")
conn.commit()

The records appear to be added to the table in this file, https://github.com/chroma-core/chroma/blob/3ed229ccb9280ca2569525bf65f4bc218c36a745/chromadb/db/mixins/embeddings_queue.py#L88

The warnings are raised in:

https://github.com/chroma-core/chroma/blob/3ed229ccb9280ca2569525bf65f4bc218c36a745/chromadb/segment/impl/vector/local_hnsw.py#L303

and

https://github.com/chroma-core/chroma/blob/3ed229ccb9280ca2569525bf65f4bc218c36a745/chromadb/segment/impl/metadata/sqlite.py#L216

and

https://github.com/chroma-core/chroma/blob/3ed229ccb9280ca2569525bf65f4bc218c36a745/chromadb/segment/impl/vector/local_persistent_hnsw.py#L242

If there are data-loss risks associated with just dropping the duplicate queue entries, perhaps add an explanation of how to delete the offending records to the warnings?

Versions

Chroma v0.4.2 MacOS Ventura.

Relevant log output

RECEIVED WHEN QUERYING ETC.

Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embedding ID: 30
Add of existing embedding ID: 31
Add of existing embedding ID: 32
Add of existing embedding ID: 33
Add of existing embedding ID: 34
Add of existing embedding ID: 35
Add of existing embedding ID: 36
Add of existing embedding ID: 37
Add of existing embedding ID: 38
Add of existing embedding ID: 39
Add of existing embedding ID: 40
Add of existing embedding ID: 41
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 48
Add of existing embedding ID: 49
Add of existing embedding ID: 50
Add of existing embedding ID: 51

RECEIVED DURING THE ADD OPERATION WHICH CAUSED THE PROBLEM:

Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7
Add of existing embedding ID: 8
Insert of existing embedding ID: 8
Add of existing embedding ID: 9
Insert of existing embedding ID: 9
Add of existing embedding ID: 10
Insert of existing embedding ID: 10
Add of existing embedding ID: 11
Insert of existing embedding ID: 11
Add of existing embedding ID: 12
Insert of existing embedding ID: 12
Add of existing embedding ID: 13
Insert of existing embedding ID: 13
Add of existing embedding ID: 14
Insert of existing embedding ID: 14
Add of existing embedding ID: 15
Insert of existing embedding ID: 15
Add of existing embedding ID: 16
Insert of existing embedding ID: 16
Add of existing embedding ID: 17
Insert of existing embedding ID: 17
Add of existing embedding ID: 18
Insert of existing embedding ID: 18
Add of existing embedding ID: 19
Insert of existing embedding ID: 19
Add of existing embedding ID: 20
Insert of existing embedding ID: 20
Add of existing embedding ID: 21
Insert of existing embedding ID: 21
Add of existing embedding ID: 22
Insert of existing embedding ID: 22
Add of existing embedding ID: 23
Insert of existing embedding ID: 23
Add of existing embedding ID: 24
Insert of existing embedding ID: 24
Add of existing embedding ID: 25
Insert of existing embedding ID: 25
Add of existing embedding ID: 26
Insert of existing embedding ID: 26
Add of existing embedding ID: 27
Insert of existing embedding ID: 27
Add of existing embedding ID: 28
Insert of existing embedding ID: 28
Add of existing embedding ID: 29
Insert of existing embedding ID: 29
Add of existing embedding ID: 30
Insert of existing embedding ID: 30
Add of existing embedding ID: 31
Insert of existing embedding ID: 31
Add of existing embedding ID: 32
Insert of existing embedding ID: 32
Add of existing embedding ID: 33
Insert of existing embedding ID: 33
Add of existing embedding ID: 34
Insert of existing embedding ID: 34
Add of existing embedding ID: 35
Insert of existing embedding ID: 35
Add of existing embedding ID: 36
Insert of existing embedding ID: 36
Add of existing embedding ID: 37
Insert of existing embedding ID: 37

mattpovey avatar Jul 24 '23 09:07 mattpovey

@HammadB can you take a look at this?

jeffchuber avatar Jul 24 '23 13:07 jeffchuber

Hi @mattpovey,

The embeddings queue will store all operations as its designed to be an event-log of user operations, I think we could definitely purge duplicate entries, but thats not something we intend to take on now.

The reason we blindly store all operations, is in the distributed architecture of Chroma, we plan to back the emebddings_queue implementation with Pulsar, a proper message queue. However, we don't want to validate entries before putting them on the queue for duplication since this would negatively affect speed. The design then, is to put things on the queue and let downstream indexing nodes decide whether or not there is a duplicate. In order to preserve API behavior, the local mode logs a warning if you add an existing embedding. However it should only log if the specific id is being added, are you seeing the warning on any add?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

HammadB avatar Jul 24 '23 22:07 HammadB

I think it is related https://github.com/chroma-core/chroma/issues/969

andrewshvv avatar Aug 12 '23 07:08 andrewshvv

@mattpovey any chance you have a repro here?

Also, it should not log on query, I am unable to reproduce this behavior, can you share a reproduction?

jeffchuber avatar Sep 06 '23 03:09 jeffchuber

127.0.0.1 - - [14/Sep/2023 00:42:15] "GET /post_url?url=https://alcova.com/4-simple-home-security-hacks/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 - Insert of existing embedding ID: 801 Add of existing embedding ID: 801 Insert of existing embedding ID: 802 Add of existing embedding ID: 802 Insert of existing embedding ID: 803 Add of existing embedding ID: 803 Insert of existing embedding ID: 804 Add of existing embedding ID: 804 Insert of existing embedding ID: 805 Add of existing embedding ID: 805 127.0.0.1 - - [14/Sep/2023 00:42:17] "GET /post_url?url=https://alcova.com/4-points-to-know-about-home-inspections/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 - Insert of existing embedding ID: 801 Add of existing embedding ID: 801 Insert of existing embedding ID: 802 Add of existing embedding ID: 802 Insert of existing embedding ID: 803 Add of existing embedding ID: 803 Insert of existing embedding ID: 804 Add of existing embedding ID: 804 Insert of existing embedding ID: 805 Add of existing embedding ID: 805 127.0.0.1 - - [14/Sep/2023 00:42:19] "GET /post_url?url=https://alcova.com/tips-to-pet-proof-your-home/&tenantID=050C7CB8-1DD5-430E-AF6D-67F2B5161E0B HTTP/1.1" 200 - Insert of existing embedding ID: 801 Add of existing embedding ID: 801 Insert of existing embedding ID: 802 Add of existing embedding ID: 802 Insert of existing embedding ID: 803 Add of existing embedding ID: 803 Insert of existing embedding ID: 804 Add of existing embedding ID: 804 Insert of existing embedding ID: 805 Add of existing embedding ID: 805

simulanics avatar Sep 14 '23 04:09 simulanics

This is a tough bug to reproduce because it only seems to happen when items get stuck in the embedding_queue. I ran into this when I noticed that my documents were not being returned when using "query_texts" but were getting returned when using "where_document={"$contains": text}". I learned that the reason was that the embeddings were never getting added to the documents because the same document embedding was stuck in the embedding_queue. This resulted in a weird silent behavior where records stopped showing up in results because there were no embeddings.

collection.delete(where={'doc_id': doc_id})

collection.add(
            documents=contents, 
            ids=ids, 
            metadatas=metadatas)

results = collection.get(
    where={'doc_id': {"$in": [doc_id]}},
    include = ['embeddings', 'metadatas', 'documents'],
)

print('input id count: ', len(ids))
print('results embeddings count: ', len(results.get('embeddings', [])))
print('results ids count: ', len(results.get('ids', [])))
input id count: 10
results embedding count: 5
results ids count: 10

This would result in

  • deleting the documents successfully
  • adding the documents (Add of existing embedding ID warnings would be shown on chroma docker instance)
  • getting the documents returned all the documents but many of them did not have any embedding.

Switching to a new collection and rerunning the script showed that all the embeddings were added correctly.

From what I've read this is because the embedding_queue is "stuck or full" so when you try to reembed the same document (with the same hash?) the embed is never added because its still stuck in the queue.

It would be helpful to have an error thrown when this happens I know the embeddings are not being added before I go to try to retrieve them. It would also be helpful to have some way to remove the stuck items in the queue or retry them?

wroscoe avatar Oct 06 '23 17:10 wroscoe

@wroscoe, thank you for the detailed analysis. This behaviour you describe was fixed in version 0.4.11+ https://github.com/chroma-core/chroma/releases/tag/0.4.11 (PR). I tried to reproduce it with the latest Chroma version but couldn't.

The issue you describe was due to a lack of checks in the BF (bruteforce index).

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

image

For each collection Chroma maintains two binary indices - Bruteforce (in-memory, fast) and HNSW lib (persisted to disk, slow when adding new vectors and persisting). As you can imagine, the BF index serves the role of a buffer that holds the uncommitted to HNWS persisted index portion of the WAL. The HNSW index itself has a max sequence id counter, stored in a metadata file, that indicates from which position in the WAL the buffering to the BF index should begin. The latter buffering usually happens when the collection is first accessed.

There are two transfer points (in the diagram, sync threshold) for BF to HNSW:

  • hnsw:batch_size - forces the BF vectors to be added to HNSW in-memory (this is a slow operation)
  • hnsw:sync_threshold - forces Chroma to dump the HNSW in-memory index to disk (this is a slow operation)

Both of the above sync points are controlled via Collection-level metadata with respective named params. It is customary hnsw:sync_threshold > hnsw:batch_size

tazarov avatar Jan 18 '24 15:01 tazarov

Can confirm that this issue still exists on 0.4.24. Is there any way to clear the embedding queue?

mcflem06 avatar Apr 01 '24 18:04 mcflem06

@mcflem06, we've found a bug and are working to fix it ASAP.

tazarov avatar Apr 24 '24 12:04 tazarov

Still gettingAdd of existing embedding ID: xx when querying a collection for the first time on chromadb==0.5.5 and 0.5.3

Bourhano avatar Aug 29 '24 11:08 Bourhano

Yeah whats up with this, what is the behavior that makes this happen ?

rkrishnasanka avatar Sep 20 '24 06:09 rkrishnasanka

For completeness, here is a diagram explaining how Chroma WAL (write-ahead log or, as you referred to it, embedding queue) works.

@tazarov Hi, Following the flow in the diagram, does it mean that each vector will be stored in two copies, one in the embedding_queue and one on disk via a persistent index? Are the vectors in the embedding_queue useless after adding the vectors to the index in memory?

Amphetaminewei avatar Sep 25 '24 09:09 Amphetaminewei

Any workaround to solve this issue ? we are facing this on chrom 0.5.6 version

RashmiSutrave avatar Sep 26 '24 11:09 RashmiSutrave