[Bug]: chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold
What happened?
from typing import List
import chromadb
from chromadb.api.configuration import HNSWConfiguration
from chromadb.api.models.Collection import Collection
from chromadb.utils.embedding_functions.sentence_transformer_embedding_function import
SentenceTransformerEmbeddingFunction
from read_word import extract_titles
class EmbeddingDB: def init(self, db, embedding_function=None): """ docker pull chromadb/chroma docker run -p 8000:8000 chromadb/chroma
m3_model = "D:/models/BGE_models"
model = SentenceTransformer(m3_model)
client = chromadb.HttpClient(host='localhost', port=8000)
:param db:
:param embedding_function:
"""
self.db = db
self.embedding_function = embedding_function
def get_or_create_collection(self, name) -> Collection:
configuration = HNSWConfiguration(batch_size=100, sync_threshold=100)
if self.embedding_function:
collection = self.db.get_or_create_collection(
name=name,
# embedding_function=self.embedding_function,
# configuration=configuration
)
else:
collection = self.db.get_or_create_collection(name=name)
return collection
def add(self, collection_name: str, string: List[str]):
"""
:param collection_name: 集合的名字
:param string:
:return:
"""
collection = self.get_or_create_collection(collection_name)
collection.add(
embeddings=self.embedding_function(string),
documents=string,
ids=[f"id{num}" for num in range(len(string))]
)
return collection
def delete_collection(self, name: str) -> None:
self.db.delete_collection(name=name)
embedding_function1 = SentenceTransformerEmbeddingFunction(model_name=m3_model) client = chromadb.HttpClient(host='xx.xx.xx.xx', port=8000)
eDB = EmbeddingDB(client, embedding_function1) titles, docs = extract_titles('wt.docx')
def load_data(): # eDB.delete_collection('docs') # eDB.delete_collection('titles')
eDB.add("docs", docs)
eDB.add("titles", titles)
if name == 'main': load_data()
the error occur on ubuntu,but it will not occur on windows
Versions
v0.5.4, ubuntu22 (or centos7.9), python3.11.9
Relevant log output
File "/root/proj/datautils.py", line 72, in load_data
eDB.add("docs", docs)
File "/root/proj/datautils.py", line 47, in add
collection = self.get_or_create_collection(collection_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/proj/datautils.py", line 30, in get_or_create_collection
collection = self.db.get_or_create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/client.py", line 166, in get_or_create_collection
model = self._server.get_or_create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 247, in get_or_create_collection
return self.create_collection(
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 146, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/fastapi.py", line 206, in create_collection
model = CollectionModel.from_json(resp_json)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/types.py", line 139, in from_json
configuration = CollectionConfigurationInternal.from_json(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
return cls(parameters=parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 115, in __init__
parameter.value = child_type.from_json(parameter.value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 217, in from_json
return cls(parameters=parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 130, in __init__
self.configuration_validator()
File "/root/miniconda3/envs/chroma/lib/python3.11/site-packages/chromadb/api/configuration.py", line 286, in configuration_validator
raise InvalidConfigurationError(
chromadb.api.configuration.InvalidConfigurationError: batch_size must be less than or equal to sync_threshold
I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.
I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files
I'm still figuring out the reproduction steps, but I think the process is
- Deploy chroma and create a collection using <=0.5.4 with
metadata={"hnsw:space": "cosine"}or similar. Specifically for me
self.collection = self.vdb.get_or_create_collection(
name=collection_name,
embedding_function=self.embedding_function,
metadata={"hnsw:space": "cosine"},
)
This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000
- Upgrade your client to 0.5.5
- It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error
I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to
self.collection = self.vdb.get_or_create_collection(
name=collection_name,
embedding_function=self.embedding_function,
metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100},
)
As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.
@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.
To fix the problem (ideally, we should've added a migration script to do that, but alas):
If in docker:
Connect to your docker container:
apt update && apt install sqlite3
sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';"
# you don't have to run the below, but for consistency reasons:
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');"
sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"
@mikethemerry, thanks to you it did not take three evenings to me to solve my problem, but only 3 minutes...
@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted.
To fix the problem (ideally, we should've added a migration script to do that, but alas):
If in docker:
Connect to your docker container:
apt update && apt install sqlite3 sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';" # you don't have to run the below, but for consistency reasons: sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');" sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"
tks,it works when update to 0.5.5,but error occur on windows ...
I've just spent three evenings tracking down the same bug and have managed to figure this out in the last half hour or so.
I think this is a regression introduced by https://github.com/chroma-core/chroma/pull/2526/files
I'm still figuring out the reproduction steps, but I think the process is
- Deploy chroma and create a collection using <=0.5.4 with
metadata={"hnsw:space": "cosine"}or similar. Specifically for meself.collection = self.vdb.get_or_create_collection( name=collection_name, embedding_function=self.embedding_function, metadata={"hnsw:space": "cosine"}, )This will create the collection with the defaults in 0.5.4 where sync_threshold=100 and batch_size=1000
- Upgrade your client to 0.5.5
- It is now checking the sync_threshold and batch_size with the existing defaults and throwing the error
I haven't read through all of the other changes to the HNSW work in 0.5.5 but it looks like there's some changes to persistent properties and similar. I actually was trying to change the configured properties specifically with different metadata definitions and similar, but was having a lot of troubles. Specifically, this was not fixed by changing that code to
self.collection = self.vdb.get_or_create_collection( name=collection_name, embedding_function=self.embedding_function, metadata={"hnsw:space": "cosine", "sync_threshold":1000, "batch_size":100}, )As a short term, I would suggest a downgrade to 0.5.4 (this has worked for me) and wait for a patch as the 0.5.5 is still in pre-release.
tks
@dddxst and @mikethemerry, thanks for reporting and investigating this. Indeed, it was a bug (#2338) released with 0.5.4 which was fixed (#2526) in 0.5.5. The issue is that any DB created with 0.5.4 would result in a validation issue you reporeted. To fix the problem (ideally, we should've added a migration script to do that, but alas): If in docker: Connect to your docker container:
apt update && apt install sqlite3 sqlite3 /chroma/chroma/chroma.sqlite3 "update collections set config_json_str=json_set(config_json_str,'$.hnsw_configuration.batch_size',100,'$.hnsw_configuration.sync_threshold',1000) where name='test';" # you don't have to run the below, but for consistency reasons: sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 100 where key='hnsw:batch_size' and collection_id in (select id from collections where name='test');" sqlite3 /chroma/chroma/chroma.sqlite3 "update collection_metadata set int_value = 1000 where key='hnsw:hnsw:sync_threshold' and collection_id in (select id from collections where name='test');"tks,it works when update to 0.5.5,but error occur on windows ...
Can you share the error you get on Windows?
Hey everyone--I believe this is caused by a version mismatch; this shouldn't happen if your client and server are on the same version. Please make sure that your server and client are both on 0.5.5 and let us know if this is still happening.