chroma
chroma copied to clipboard
[Bug]: Getting KeyError on using filter
What happened?
Seeing KeyError on using filter
self._index.similarity_search_by_vector_with_relevance_scores(
embedding=np.array(OpenAIEmbeddings().embed_documents(["door"]))[0].tolist(),
k=num_records_to_retrieve,
filter={'finish': {"$ne":""}},
)
I get
{KeyError}'bd294fd5f8a044e8bebf67e005b102f3'
if I do
self._index.similarity_search_by_vector_with_relevance_scores(
embedding=np.array(OpenAIEmbeddings().embed_documents(["door"]))[0].tolist(),
k=num_records_to_retrieve,
filter={},
)
I get
[(Document(page_content='nan', metadata={'finish': 'Sealant-Coated', 'size': '28.58 x 51.99 x 7.95'}), 0.3449656069278717)]
Versions
Chromadb==0.3.25, Python==3.9.15, langchain==0.0.352
Relevant log output
No response
Here
from langchain.vectorstores import Chroma
_index: Chroma
facing same issue here... without filter able to query the the vector db.. with filter getting key error..
@wk-vaid, what version of Langchain🦜🔗 are you using and can you share a snippet of your code that results in the error?
I am facing the same issue, I can see that the problem is not to do with langchain, but with chroma db itself. The same error is thrown when using the following: index._collection.query(query_embeddings=[embeddings], where={'filter':'this_filter_finds_a_match'}), This is the chroma db function rather than the langchain one.
When using a filter which I know results in no findings, there is no error but the return is an empty list as expected. When filtering on metadata which does find a match, the error is thrown.
I am using chromadb==0.5.0, langchain-community==0.0.36, and HugginFaceInstructEmbeddings, InstructorEmbedding=1.0.1
@thomaspile, do you mind sharing the stack trace?
@tazarov `KeyError: '1844cf0e8c1d466fb619be74873831ca'
KeyError Traceback (most recent call last)
File
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/api/models/Collection.py:223, in Collection.query(self, query_embeddings, query_texts, n_results, where, where_document, include) 220 if where_document is None: 221 where_document = {} --> 223 return self._client._query( 224 collection_id=self.id, 225 query_embeddings=query_embeddings, 226 n_results=n_results, 227 where=where, 228 where_document=where_document, 229 include=include, 230 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/api/local.py:457, in LocalAPI._query(self, collection_id, query_embeddings, n_results, where, where_document, include) 447 @override 448 def _query( 449 self, (...) 455 include: Include = ["documents", "metadatas", "distances"], 456 ) -> QueryResult: --> 457 uuids, distances = self._db.get_nearest_neighbors( 458 collection_uuid=collection_id, 459 where=where, 460 where_document=where_document, 461 embeddings=query_embeddings, 462 n_results=n_results, 463 ) 465 include_embeddings = "embeddings" in include 466 include_documents = "documents" in include
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/clickhouse.py:613, in Clickhouse.get_nearest_neighbors(self, collection_uuid, where, embeddings, n_results, where_document) 610 ids = None 612 index = self._index(collection_uuid) --> 613 uuids, distances = index.get_nearest_neighbors(embeddings, n_results, ids) 615 return uuids, distances
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py:285, in Hnswlib.get_nearest_neighbors(self, query, k, ids) 283 labels: Set[int] = set() 284 if ids is not None: --> 285 labels = {self._id_to_label[hexid(id)] for id in ids} 286 if len(labels) < k: 287 k = len(labels)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py:285, in
@thomaspile,
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/clickhouse.py:613
This tells me that you are using an old version of Chroma, possibly in the 0.3.x range. Chroma has dropped clickhouse support since 0.4.0. Is it possible for you to upgrade to 0.5.0?
@tazarov
Ah yes, I had to downgrade chroma because version 0.4.0 and above results in an OperationalError: disk I/O error. Do you have any idea why that might be?
Here is the stack:
`File <command-410812424293831>, line 69, in Embedder.import_index(self)
65 def import_index(self):
67 print('Importing index')
---> 69 self.index = Chroma(persist_directory=f'{self.directory}',
70 embedding_function=self.embedding_service,
71 collection_metadata={"hnsw:space": "cosine"})
73 self.index_ids = self.index.get()['ids']
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:121, in Chroma.__init__(self, collection_name, embedding_function, persist_directory, client_settings, collection_metadata, client, relevance_score_fn)
119 _client_settings = chromadb.config.Settings()
120 self._client_settings = _client_settings
--> 121 self._client = chromadb.Client(_client_settings)
122 self._persist_directory = (
123 _client_settings.persist_directory or persist_directory
124 )
126 self._embedding_function = embedding_function
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/__init__.py:145, in Client(settings)
142 telemetry_client = system.instance(Telemetry)
143 api = system.instance(API)
--> 145 system.start()
147 # Submit event for client start
148 telemetry_client.capture(ClientStartEvent())
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/config.py:268, in System.start(self)
266 super().start()
267 for component in self.components():
--> 268 component.start()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:93, in SqliteDB.start(self)
91 cur.execute("PRAGMA foreign_keys = ON")
92 cur.execute("PRAGMA case_sensitive_like = ON")
---> 93 self.initialize_migrations()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/migrations.py:128, in MigratableDB.initialize_migrations(self)
125 self.validate_migrations()
127 if migrate == "apply":
--> 128 self.apply_migrations()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/migrations.py:147, in MigratableDB.apply_migrations(self)
145 def apply_migrations(self) -> None:
146 """Validate existing migrations, and apply all new ones."""
--> 147 self.setup_migrations()
148 for dir in self.migration_dirs():
149 db_migrations = self.db_migrations(dir)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:149, in SqliteDB.setup_migrations(self)
147 @override
148 def setup_migrations(self) -> None:
--> 149 with self.tx() as cur:
150 cur.execute(
151 """
152 CREATE TABLE IF NOT EXISTS migrations (
(...)
160 """
161 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:47, in TxWrapper.__exit__(self, exc_type, exc_value, traceback)
45 if len(self._tx_stack.stack) == 0:
46 if exc_type is None:
---> 47 self._conn.commit()
48 else:
49 self._conn.rollback()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite_pool.py:31, in Connection.commit(self)
30 def commit(self) -> None:
---> 31 self._conn.commit()
OperationalError: disk I/O error`
@thomaspile, there is a bit of a migration procedure if you start with 0.3.x (https://docs.trychroma.com/migration#migration-from-040-to-040---july-17-2023).
The steps for migration would be as follows:
- Start with the above link, but install Chroma 0.4.15<
- Once you successfully upgrade from 0.3.x to 0.4.15< then upgrade Chroma to 0.5.0 and try to access your DB
If the above seems somewhat complex, you can export your data and import it again. A CSV or similar should do fine.
Thanks @tazarov, however I am just looking to start a new index from scratch, so no migration is necessary. The error happens regardless.
I think the error you are encountering has to do with your storage medium.
Looking at /local_disk0/.ephemeral_nfs I can assume this is some sort of block storage. Can you elaborate on how you are running Chroma (container, chroma run, etc.)?
@tazarov I am running on Azure Databricks, and importing chroma using langchain_community.vectorstores import Chroma
If my assumption is correct, you are using some sort of shared storage that relies on NFS. NFS is inherently not a good choice for Chroma workloads and will occasionally result in the I/O error above.
Can you point persistent dir to another location?
Ah right I see, unfortunately I can't use any other location. I've decided to use cosmos DB for now instead. Thanks for your help anyway, much appreciated.