chroma
chroma copied to clipboard
[Bug]: Upserting the same data causes the SQLite db to grow by 50-100%
What happened?
I'm using Chroma in a Python chat type of app in order to store what could be considered entities and to do RAG on a few hundred documents. This data is mostly static - it updates very rarely, and when it does, by very little. Think a few new entities/keywords every hour and/or a couple more articles for RAG per day. However, every time I run the import scripts, even at 1 minute intervals, the SQLite DB grows by 50-100%. For example:
- run 1: from empty db to 35 MB
- run 2 (a few minutes later): 62 MB
- run 3 (a few mins later): 89 MB
- run 4 (a few mins later): 113 MB
I haven't diffed the data as it's coming from multiple sources, but I expect the data was 99.99% identical on every import.
The issue is that the db grows very fast (it was 3 GB in size in production after a few days) and Chroma becomes impossible to use (it clogs all the CPU cores and never fetches the data at that size).
PS - looking at the expansion rate, seems to grow by more or less the initial 35 MB.
Versions
chromadb 0.4.24
python 3.10.10
LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009
Relevant log output
No response
Code:
for article in fetch_content_articles(content_type):
sections = []
try:
sections = json.loads(article[3])
except Exception as _:
pass
content_article = {
"id": str(article[0]), # chromadb expects a string, not an integer
"documents": "".join([
markdownify.markdownify('<h2>' + doc['sections_title'] + '</h2>' + doc['sections_content']) for doc in sections
]),
"metadata": {
"title": article[1],
"slug": article[2],
"image": article[4] or "",
"updated_at": article[5].timestamp(), # chromadb expects a timestamp, not a datetime object
"article_preview": article[7] or "",
"type": CONTENT_ARTICLES_TYPES[article[8]],
"geo": "ie" if article[9] == 2 else "uk"
}
}
embeddings.add( entity=collection_name,
ids=[content_article['id']],
items=[content_article['documents']],
metadata=[content_article['metadata']]
)
embeddings.add()
def add(
entity: str, ids: list[str], items: list[str], metadata: list[dict] | None = None
):
try:
get_collection(entity).upsert(
ids=ids,
documents=items,
metadatas=metadata,
)
except Exception as ex:
print(ex)
pass
I suspect that most of the expansion here is coming from the WAL. unfortunately we don't have first party support for cleaning the WAL right now but @tazarov has some community supported tools.
We hope to add this to the core API.
@essenciary this is an explanation of how the WAL works - https://cookbook.chromadb.dev/core/advanced/wal/
And here's the explanation of how to prune (clean) it up: https://cookbook.chromadb.dev/core/advanced/wal-pruning/. The tooling is here: https://github.com/amikos-tech/chromadb-ops.
⚠️ ALWAYS make backups 😄