chroma
chroma copied to clipboard
[Bug]: sqlite3.OperationalError: database or disk is full
What happened?
What Happened:
- Encountered an error with a SQLite database in a Docker container environment.
- The error message was sqlite3.OperationalError: database or disk is full.
- This issue occurred despite the host machine having sufficient disk space.
- The SQLite database file size was found to be approximately 4.1 GB.
- The Docker container settings and host machine settings were checked for potential causes of the error.
Expected Behavior:
- The SQLite database should operate without encountering a 'disk is full' error, especially considering that the host machine had adequate disk space.
- Given the size of the SQLite file (4.1 GB) and the typical capabilities of SQLite and the Docker environment, normal database operations such as data insertion, updating, and querying were expected to occur without errors related to disk space.
- The expectation was that the Docker container's configuration and the host system's file system would support the operation of a database of this size without triggering disk space-related errors.
Versions
ChromaDB V 0.4.9 Python 3.10
Relevant log output
sqlite3.OperationalError: database or disk is full
INFO: [02-02-2024 04:10:33] 3.131.62.47:40862 - "POST /api/v1/collections/559a54f0-9471-48af-98af-4d19c5fbd2db/add HTTP/1.1" 500
INFO: [02-02-2024 04:10:33] 3.131.62.47:40862 - "POST /api/v1/collections/559a54f0-9471-48af-98af-4d19c5fbd2db/query HTTP/1.1" 200
ERROR: [02-02-2024 04:10:34] database or disk is full
@sachinchawla, you are using a relatively old version of Chroma in which Chroma data was stored internally in the container unless you have- a custom docker compose or docker command with mounts. If you are running on Linux, this might not be a problem, but on Windows and Mac, where docker runs in a VM.
Traceback (most recent call last):
File "/home/richard/book-mentat/src/chroma_info_custom.py", line 43, in <module>
batch = collection.get()
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
return f(self, *args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
records = metadata_segment.get_metadata(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
return list(self._records(cur, q))
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full
database is 37GB - so plenty of memory available - is on a drive with 2TB free - is there some sort of temp space issue problem?
chroma 0.2.0 pypi_0 pypi
chroma-hnswlib 0.7.3 pypi_0 pypi
chromadb 0.5.0 pypi_0 pypi
python 3.10.14 hd12c33a_0_cpython conda-forge
This is on trying to query - database is still allowing data to go in.
@RichardScottOZ, if you are running in a container, can you run:
docker exec -it <container_name_or_id> df -h /chroma/chroma
Let's see what your container reports as spare disk size.
Hi, thanks. Not running in a container, just installed it on a ubuntu server.
A note - I thought it could have been the size of the get, so I tried this:
Traceback (most recent call last):
File "/home/richard/book-mentat/src/chroma_info_custom_loop.py", line 46, in <module>
ids_only_result = collection.get(include=[])
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 211, in get
get_results = self._client._get(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/rate_limiting/__init__.py", line 45, in wrapper
return f(self, *args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/api/segment.py", line 517, in _get
records = metadata_segment.get_metadata(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 143, in wrapper
return f(*args, **kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 216, in get_metadata
return list(self._records(cur, q))
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/chromadb/segment/impl/metadata/sqlite.py", line 225, in _records
cur.execute(sql, params)
sqlite3.OperationalError: database or disk is full
Is there some sort of integer limit or anything this might hit? It is late, I have not looked at the repo code as yet to try and work it out, will do tomorrow.
I can query a model using an index fine - so it seems like it is a collection information issue, not a db issue.
hey @RichardScottOZ, thanks for confirming let's do the following:
See how much space you have in persist dir:
df -h /path/to/chroma_persist
Let's check how much space you have in your /tmp although I'm skeptical sqlite3 uses it:
df -h /tmp
Check the max_page_count of the SQLite:
sqlite3 /path/to/chroma_persist/chroma.sqlite3 "PRAGMA max_page_count;"
the disk chroma is on has 2.5 TB free, tmp has 8 gb
on page count sqlite3 python?
@RichardScottOZ, if you are on Linux you can install the sqlite3 library e.g. for Debian-based distros sudo apt update && sudo apt install sqlite3 then sqlite3 executable will be in your path. Once installed, you can copy and paste (adjust the path) the above example.
yeah, had never needed it - will take a look
$ sqlite3 /mnt/usb_mount/chroma/Calibre\ Books/chroma.sqlite3 "PRAGMA max_page_count;"
1073741823
quite a big number
@RichardScottOZ, you are right. 1073741823 pages * 4096 bytes per page ~ 4.4TB max size of the sqlite3 file. So the size of your sqlite3 file (37GB) is not a problem and we can rule it out.
Let's examine the nature of your workload now. You said that ingestion is fine, but the query causes an issue. Can you elaborate on your query? Can you share a snippet + how many results do you expect it to return?
when it started not working, likely had 7000 books? was trying to get the names of all them to list in alpha order where they were up to
this is a bit convoluted, but was working previously:
batch = collection.get()
print(len(batch))
for b in batch:
print(b)
count = 0
file_dict = {}
for x in range(len(batch["documents"])):
doc = batch["metadatas"][x]
print(doc['file_name'])
count += 1
file_dict[doc['file_name']] = 1
print(count)
print(file_dict)
print(len(file_dict))
sorted_dict = dict(sorted(file_dict.items()))
for key in sorted_dict:
print(key)
print(len(sorted_dict))
@RichardScottOZ, ok I think I understand now what might be the culprit here. SQLite uses temp storage for large result sets. In your case it ends up in /tmp (see https://www.sqlite.org/tempfiles.html). On a 37GB DB, there is a good chance that your collection.get() returns a huge number of results, thus overflowing /tmp storage capacity (hence the error). It is possible to specify the temp path via PRAGMA, but that is a code change in Chroma that we need to consider further.
In the meantime, can I ask you to try and paginate your collection.get() (see this code snippet for inspiration - https://cookbook.chromadb.dev/core/collections/#cloning-a-collection). Let me know the results.
So temp space as considered above. Will try the above tomorrow thanks.
splitting into sizeable chunks worked for the above use anyway, thanks
Hi @tazarov, I am facing the same issue with the code below. Is this fixed yet or what is the current work around?
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_nomic.embeddings import NomicEmbeddings
vectorstore = Chroma.from_documents(
documents=doc_splits,
collection_name="rag-chroma",
embedding=NomicEmbeddings(model="nomic-embed-text-v1.5", inference_mode="local"),
)
retriever = vectorstore.as_retriever()
Here is the output error:
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
144 global tracer, granularity
145 if trace_granularity < granularity:
--> 146 return f(*args, **kwargs)
147 if not tracer:
148 return f(*args, **kwargs)
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/api/segment.py:445, in SegmentAPI._upsert(self, collection_id, ids, embeddings, metadatas, documents, uris)
434 records_to_submit = list(
435 _records(
436 t.Operation.UPSERT,
(...)
442 )
443 )
444 self._validate_embedding_record_set(coll, records_to_submit)
--> 445 self._producer.submit_embeddings(collection_id, records_to_submit)
447 return True
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/telemetry/opentelemetry/__init__.py:146, in trace_method.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
144 global tracer, granularity
145 if trace_granularity < granularity:
--> 146 return f(*args, **kwargs)
147 if not tracer:
148 return f(*args, **kwargs)
File /srv/data/anaconda3/envs/chask/lib/python3.10/site-packages/chromadb/db/mixins/embeddings_queue.py:239, in SqlEmbeddingsQueue.submit_embeddings(self, collection_id, embeddings)
236 # The returning clause does not guarantee order, so we need to do reorder
237 # the results. https://www.sqlite.org/lang_returning.html
238 sql = f"{sql} RETURNING seq_id, id" # Pypika doesn't support RETURNING
--> 239 results = cur.execute(sql, params).fetchall()
240 # Reorder the results
241 seq_ids = [cast(SeqId, None)] * len(
242 results
243 ) # Lie to mypy: https://stackoverflow.com/questions/76694215/python-type-casting-when-preallocating-list
OperationalError: database or disk is full
Here is my /tmp space allocation: