chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Embeddings Deletion Causes "Delete of nonexisting embedding ID"

Open mickey-lyx opened this issue 2 years ago • 15 comments

What happened?

Hi there, I tried to upload two PDF files to a persistant collection and delete one of them. But I received Warning Messages: "Delete of nonexisting embedding ID". This Warning only appears when I upload multiple files and delete one of them. Here are my test files and code.

alphabet-2023-q1-10q.pdf Apple Inc.-10K.pdf

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


def main():
    # create collection
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

    # load document_1
    loader_1 = PyPDFLoader("alphabet-2023-q1-10q.pdf")
    documents1 = loader_1.load()
    docs_1 = text_splitter.split_documents(documents1)
    ids_1 = [str(i) for i in range(1, len(docs_1) + 1)]
    texts_1 = [split.page_content for split in docs_1]
    metadatas_1 = [split.metadata for split in docs_1]
    collection.add(ids=ids_1, metadatas=metadatas_1, documents=texts_1)

    # load document_2
    loader_2 = PyPDFLoader("Apple Inc.-10K.pdf")
    documents_2 = loader_2.load()
    docs_2 = text_splitter.split_documents(documents_2)
    ids_2 = [str(i) for i in range(47, len(docs_2) + 47)]
    texts_2 = [split.page_content for split in docs_2]
    metadatas_2 = [split.metadata for split in docs_2]
    collection.add(ids=ids_2, metadatas=metadatas_2, documents=texts_2)

    print(f"ids_1: {ids_1}")
    print(f"ids_2: {ids_2}")

    print("count before", collection.count())
    # delete document_1
    collection.delete(ids_1)
    print("count after", collection.count())


if __name__ == '__main__':
    main()

Versions

chromadb==0.4.5 langchain==0.0.264 python==3.10.12 MacOS==13.3.1

Relevant log output

ids_1: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46']
ids_2: ['47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107']
count before 107
Delete of nonexisting embedding ID: 1
Delete of nonexisting embedding ID: 2
Delete of nonexisting embedding ID: 3
Delete of nonexisting embedding ID: 4
Delete of nonexisting embedding ID: 5
Delete of nonexisting embedding ID: 6
Delete of nonexisting embedding ID: 7
Delete of nonexisting embedding ID: 8
Delete of nonexisting embedding ID: 9
Delete of nonexisting embedding ID: 10
Delete of nonexisting embedding ID: 11
Delete of nonexisting embedding ID: 12
Delete of nonexisting embedding ID: 13
Delete of nonexisting embedding ID: 14
Delete of nonexisting embedding ID: 15
Delete of nonexisting embedding ID: 16
Delete of nonexisting embedding ID: 17
Delete of nonexisting embedding ID: 18
Delete of nonexisting embedding ID: 19
Delete of nonexisting embedding ID: 20
Delete of nonexisting embedding ID: 21
Delete of nonexisting embedding ID: 22
Delete of nonexisting embedding ID: 23
Delete of nonexisting embedding ID: 24
Delete of nonexisting embedding ID: 25
Delete of nonexisting embedding ID: 26
Delete of nonexisting embedding ID: 27
Delete of nonexisting embedding ID: 28
Delete of nonexisting embedding ID: 29
Delete of nonexisting embedding ID: 30
Delete of nonexisting embedding ID: 31
Delete of nonexisting embedding ID: 32
Delete of nonexisting embedding ID: 33
Delete of nonexisting embedding ID: 34
Delete of nonexisting embedding ID: 35
Delete of nonexisting embedding ID: 36
Delete of nonexisting embedding ID: 37
Delete of nonexisting embedding ID: 38
Delete of nonexisting embedding ID: 39
Delete of nonexisting embedding ID: 40
Delete of nonexisting embedding ID: 41
Delete of nonexisting embedding ID: 42
Delete of nonexisting embedding ID: 43
Delete of nonexisting embedding ID: 44
Delete of nonexisting embedding ID: 45
Delete of nonexisting embedding ID: 46
count after 61

Process finished with exit code 0

mickey-lyx avatar Aug 15 '23 05:08 mickey-lyx

I have the problem too

qyzhizi avatar Aug 19 '23 18:08 qyzhizi

@tazarov Hi, could you please look at this problem? Thank you for you time!

mickey-lyx avatar Aug 21 '23 21:08 mickey-lyx

@mickey-lyx, thanks for reporting this. I'll take a look at this soon. At a glance, the code looks fine, and the actual result seems to be fine - you have 61 docs once you remove 47 from the starting 107. All in all, this seems like a warning, not an actual bug. The I will have a look and let you know.

tazarov avatar Aug 21 '23 21:08 tazarov

@tazarov Really appreciate it. The result is right. I'm just wondering why there appears to be warnings of deleting nonexisting embeddings. Is it because the embeddings were deleted multiple times?

mickey-lyx avatar Aug 22 '23 01:08 mickey-lyx

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

guyko81 avatar Aug 31 '23 23:08 guyko81

I'm having the same issue. This seems to occur even when an empty list is passed as ids to Collection.delete.

becklabs avatar Sep 03 '23 23:09 becklabs

We'd love to get this fixed - is anyone able to help post a minimal repro?

jeffchuber avatar Sep 06 '23 03:09 jeffchuber

@jeffchuber

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


def main():
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())

    num_1 = 47
    num_2 = 70

    texts_1 = [f"text_1.{i}" for i in range(num_1)]
    ids_1 = [f"1.{i}" for i in range(num_1)]
    texts_2 = [f"text_2.{i}" for i in range(num_2)]
    ids_2 = [f"2.{i}" for i in range(num_2)]

    collection.add(ids=ids_1, documents=texts_1)
    collection.add(ids=ids_2, documents=texts_2)

    print("count before", collection.count())
    collection.delete(ids_1)
    print("count after", collection.count())


if __name__ == '__main__':
    main()

mickey-lyx avatar Sep 07 '23 17:09 mickey-lyx

I'm seeing similar warnings, but I'm unsure if I should be concerned since it's a warning. It would be good to get some insights to why this occurs even after uploading a few PDF files and while the fastapi is idle, keeps logging.

112-49d5-a776-2c02c03897e8:77661df1-86bc-4f33-9119-a90d77f7c24e
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:484a228b-de38-4674-8f14-078f4f218afd
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:51c75801-6ecd-4490-941e-8ee6f2229476
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:282cb350-257b-49ef-ae55-ab3997099d58
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:fe9d8119-b72a-44c1-9bc5-f5c173621a4b
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:c92f759d-f0e7-46e9-9156-e5c47e917de7
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5be4bf1c-7c02-4815-9c25-de4463b0231f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:32500766-ceb7-4b12-8e8d-04b34306f30f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:7e5d60fd-cb8a-4ecf-adf3-8d86694458e8
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5cfbdc44-cc08-4749-8d5d-d628f6aa4676
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-

package versions

chromadb==0.4.10 langchain==0.0.225

Running chroma client server with the latest Docker version

  chroma:
    container_name: chroma
    image: ghcr.io/chroma-core/chroma:latest
    volumes:
      - index_data:/chroma/chroma
    environment:
      - IS_PERSISTENT=true
      - CHROMA_SERVER_HTTP_PORT=8000
    restart: unless-stopped
    ports:
      - '8000:8000'
    networks:
      - mynetwork

timothymugayi avatar Sep 16 '23 15:09 timothymugayi

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

I am having this exact issue too

chrispangg avatar Sep 17 '23 21:09 chrispangg

@jeffchuber, @chrispangg, @timothymugayi, @mickey-lyx, As I mentioned above, the issue is benign. Chroma maintains a temporary index of embeddings before it flushes it to disk after it reaches a certain threshold. In your example, the threshold is reached (100) so the temp index is flushed and cleared, and subsequent entries are appended to it, but when delete comes right after add Chroma attempts to remove any and all embeddings from the temporary index which leads to the warning you see. I have made a fix to properly check if ids to be removed are part of the temp index and if not Chroma will not attempt deletion.

PR's on the way.

tazarov avatar Sep 18 '23 16:09 tazarov

@HammadB I think we can close this now.

tazarov avatar Oct 25 '23 09:10 tazarov

I think this issue is still present. I've just stumbled upon it in my application. And I'm using latest (0.4.24) version of Chroma, so the fix from #1150 should probably be already merged.

s-peryt avatar Apr 14 '24 05:04 s-peryt

我更新了chromadb==0.5.0,但还是有这个问题: 我是用threading更新的: t=threading.Thread(target=mydb.add_collection_from_file,args=[local_f],daemon=True) t.start()

running-frog avatar May 09 '24 21:05 running-frog

@running-frog, @s-peryt, we have a bug in the HNSW binary index that, under certain conditions, can result in the above errors. There is a PR - #2062 that should resolve this.

tazarov avatar May 12 '24 07:05 tazarov