chroma
chroma copied to clipboard
[Bug]: Document deletion leaves records in database & folders on disk
What happened?
To remove my documents from a collection I call: collection.delete(ids=ids)
but I am still seeing db data as well as the folder that is created when embeddings are made (has files like: data_level0.bin, link_lists.bin, i dont know what these are).
I am noticing embedding data in embedding_fulltext_search
(and other records) with plain text being left behind. All the other document data has been removed, but this still remains. Also the folder with db files remains in my project's /chromadb dir.
Everything else in the db seems to be removed successfully except these two things. My concern is over time with multiple tenants my server will run out of space as text data is never removed from the db or the hard disk.
I am using PersistentClient not the HttpClient FYI. I am also using this in conjunction with Llama-index if that makes any diff.
Versions
Chroma 0.4.24, Python 3.12 but also seen on 3.11, also using llama-index 0.10.0
Relevant log output
No response
I know of this other issue: https://github.com/chroma-core/chroma/issues/1987 however I am not getting any error message and I am still able to add new documents to the store.
After deleting all collections I am also seeing the collection_metadata
record is still in the database with all the data from all my previously deleted collections still there.
However the collections
record is wiped clean and all other associated data (embeddings, etc) seems to be deleted successfully.
Ok so I noticed Chroma released a 0.5.0 version few days ago so I decided to try it to see of it fixes and....it...does, except for the issue that llama-index doesnt yet support this version: llama-index-vector-stores-chroma 0.1.6 requires chromadb<0.5.0,>=0.4.22, but you have chromadb 0.5.0 which is incompatible.
But from what I can tell when I delete a document, the embedding_fulltext_search
records are not gone! I suppose all there is now to do is wait for llama-index to support v0.5.0.
In the meantime anyone else having this issue that wants to verify on v0.5.0 ?
hey @dieharders, thanks for the detailed description.
There are a whole bunch of issues you refer to in this, so let's break them down:
- FTS (
embedding_fulltext_search
) index residuals this were addressed in #1664 and #1689 (the latter in 0.5.0) - the
.bin
files are part of the HNSW binary index, and they requireclient.delete_collection("collection_name")
to be removed. Check docs here -
collection_metadata
issue - yes, this is a known bug that is addressed in #1666 (hasn't been merged yet)
I'm making a PR for the llama index to support Chroma 0.5.0 as the new version does not seem to be breaking the tests.
- 👍
- Ah yes you are right, when I delete all collections they are gone.
- I will 👀 https://github.com/chroma-core/chroma/pull/1666
Good to hear thank you! Thank you for your hard work, I'm enjoying using the db :)
@dieharders sounds like I can close this issue. Please re-open if I've missed something. And glad we got you sorted :^)