chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Document deletion leaves records in database & folders on disk

Open dieharders opened this issue 10 months ago • 5 comments

What happened?

To remove my documents from a collection I call: collection.delete(ids=ids) but I am still seeing db data as well as the folder that is created when embeddings are made (has files like: data_level0.bin, link_lists.bin, i dont know what these are).

I am noticing embedding data in embedding_fulltext_search (and other records) with plain text being left behind. All the other document data has been removed, but this still remains. Also the folder with db files remains in my project's /chromadb dir.

Everything else in the db seems to be removed successfully except these two things. My concern is over time with multiple tenants my server will run out of space as text data is never removed from the db or the hard disk.

I am using PersistentClient not the HttpClient FYI. I am also using this in conjunction with Llama-index if that makes any diff.

Versions

Chroma 0.4.24, Python 3.12 but also seen on 3.11, also using llama-index 0.10.0

Relevant log output

No response

dieharders avatar Apr 25 '24 20:04 dieharders

I know of this other issue: https://github.com/chroma-core/chroma/issues/1987 however I am not getting any error message and I am still able to add new documents to the store.

dieharders avatar Apr 25 '24 20:04 dieharders

After deleting all collections I am also seeing the collection_metadata record is still in the database with all the data from all my previously deleted collections still there.

However the collections record is wiped clean and all other associated data (embeddings, etc) seems to be deleted successfully.

dieharders avatar Apr 25 '24 21:04 dieharders

Ok so I noticed Chroma released a 0.5.0 version few days ago so I decided to try it to see of it fixes and....it...does, except for the issue that llama-index doesnt yet support this version: llama-index-vector-stores-chroma 0.1.6 requires chromadb<0.5.0,>=0.4.22, but you have chromadb 0.5.0 which is incompatible.

But from what I can tell when I delete a document, the embedding_fulltext_search records are not gone! I suppose all there is now to do is wait for llama-index to support v0.5.0.

In the meantime anyone else having this issue that wants to verify on v0.5.0 ?

dieharders avatar Apr 25 '24 21:04 dieharders

hey @dieharders, thanks for the detailed description.

There are a whole bunch of issues you refer to in this, so let's break them down:

  • FTS (embedding_fulltext_search) index residuals this were addressed in #1664 and #1689 (the latter in 0.5.0)
  • the .bin files are part of the HNSW binary index, and they require client.delete_collection("collection_name") to be removed. Check docs here
  • collection_metadata issue - yes, this is a known bug that is addressed in #1666 (hasn't been merged yet)

I'm making a PR for the llama index to support Chroma 0.5.0 as the new version does not seem to be breaking the tests.

tazarov avatar Apr 29 '24 12:04 tazarov

  • 👍
  • Ah yes you are right, when I delete all collections they are gone.
  • I will 👀 https://github.com/chroma-core/chroma/pull/1666

Good to hear thank you! Thank you for your hard work, I'm enjoying using the db :)

dieharders avatar Apr 30 '24 02:04 dieharders

@dieharders sounds like I can close this issue. Please re-open if I've missed something. And glad we got you sorted :^)

beggers avatar May 01 '24 21:05 beggers