chroma [Bug]: Upserting the same data causes the SQLite db to grow by 50-100%

What happened?

I'm using Chroma in a Python chat type of app in order to store what could be considered entities and to do RAG on a few hundred documents. This data is mostly static - it updates very rarely, and when it does, by very little. Think a few new entities/keywords every hour and/or a couple more articles for RAG per day. However, every time I run the import scripts, even at 1 minute intervals, the SQLite DB grows by 50-100%. For example:

run 1: from empty db to 35 MB
run 2 (a few minutes later): 62 MB
run 3 (a few mins later): 89 MB
run 4 (a few mins later): 113 MB

I haven't diffed the data as it's coming from multiple sources, but I expect the data was 99.99% identical on every import.

The issue is that the db grows very fast (it was 3 GB in size in production after a few days) and Chroma becomes impossible to use (it clogs all the CPU cores and never fetches the data at that size).

PS - looking at the expansion rate, seems to grow by more or less the initial 35 MB.

Versions

chromadb 0.4.24

python 3.10.10

LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009

Relevant log output

No response

May 06 '24 19:05 essenciary

Code:

for article in fetch_content_articles(content_type):
    sections = []
    try:
      sections = json.loads(article[3])
    except Exception as _:
      pass

    content_article = {
      "id": str(article[0]), # chromadb expects a string, not an integer
      "documents": "".join([
        markdownify.markdownify('<h2>' + doc['sections_title'] + '</h2>' + doc['sections_content']) for doc in sections
      ]),
      "metadata": {
        "title": article[1],
        "slug": article[2],
        "image": article[4] or "",
        "updated_at": article[5].timestamp(), # chromadb expects a timestamp, not a datetime object
        "article_preview": article[7] or "",
        "type": CONTENT_ARTICLES_TYPES[article[8]],
        "geo": "ie" if article[9] == 2 else "uk"
      }
    }

    embeddings.add( entity=collection_name, 
                    ids=[content_article['id']],
                    items=[content_article['documents']],
                    metadata=[content_article['metadata']]
                  )

embeddings.add()

def add(
    entity: str, ids: list[str], items: list[str], metadata: list[dict] | None = None
):
    try:
        get_collection(entity).upsert(
            ids=ids,
            documents=items,
            metadatas=metadata,
        )
    except Exception as ex:
        print(ex)
        pass

May 06 '24 20:05 essenciary

I suspect that most of the expansion here is coming from the WAL. unfortunately we don't have first party support for cleaning the WAL right now but @tazarov has some community supported tools.

We hope to add this to the core API.

May 06 '24 22:05 HammadB

@essenciary this is an explanation of how the WAL works - https://cookbook.chromadb.dev/core/advanced/wal/

And here's the explanation of how to prune (clean) it up: https://cookbook.chromadb.dev/core/advanced/wal-pruning/. The tooling is here: https://github.com/amikos-tech/chromadb-ops.

⚠️ ALWAYS make backups 😄

May 07 '24 16:05 tazarov

chroma chroma copied to clipboard

[Bug]: Upserting the same data causes the SQLite db to grow by 50-100%

What happened?

Versions

chromadb 0.4.24

python 3.10.10

Relevant log output

chroma
chroma copied to clipboard