chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Bug]: Getting KeyError on using filter

Open shivamtd opened this issue 1 year ago • 13 comments

What happened?

Seeing KeyError on using filter

self._index.similarity_search_by_vector_with_relevance_scores(
                embedding=np.array(OpenAIEmbeddings().embed_documents(["door"]))[0].tolist(),
                k=num_records_to_retrieve,
                filter={'finish': {"$ne":""}},
            )

I get {KeyError}'bd294fd5f8a044e8bebf67e005b102f3'

if I do

self._index.similarity_search_by_vector_with_relevance_scores(
                embedding=np.array(OpenAIEmbeddings().embed_documents(["door"]))[0].tolist(),
                k=num_records_to_retrieve,
                filter={},
            )

I get [(Document(page_content='nan', metadata={'finish': 'Sealant-Coated', 'size': '28.58 x 51.99 x 7.95'}), 0.3449656069278717)]

Versions

Chromadb==0.3.25, Python==3.9.15, langchain==0.0.352

Relevant log output

No response

shivamtd avatar Feb 01 '24 17:02 shivamtd

Here

from langchain.vectorstores import Chroma
_index: Chroma

shivamtd avatar Feb 01 '24 18:02 shivamtd

facing same issue here... without filter able to query the the vector db.. with filter getting key error..

wk-vaid avatar Feb 02 '24 04:02 wk-vaid

@wk-vaid, what version of Langchain🦜🔗 are you using and can you share a snippet of your code that results in the error?

tazarov avatar May 08 '24 12:05 tazarov

I am facing the same issue, I can see that the problem is not to do with langchain, but with chroma db itself. The same error is thrown when using the following: index._collection.query(query_embeddings=[embeddings], where={'filter':'this_filter_finds_a_match'}), This is the chroma db function rather than the langchain one.

When using a filter which I know results in no findings, there is no error but the return is an empty list as expected. When filtering on metadata which does find a match, the error is thrown.

I am using chromadb==0.5.0, langchain-community==0.0.36, and HugginFaceInstructEmbeddings, InstructorEmbedding=1.0.1

thomaspile avatar May 08 '24 14:05 thomaspile

@thomaspile, do you mind sharing the stack trace?

tazarov avatar May 08 '24 15:05 tazarov

@tazarov `KeyError: '1844cf0e8c1d466fb619be74873831ca'

KeyError Traceback (most recent call last) File , line 3 1 query = 'good document' 2 embedded = embedding_service.embed_query(query) ----> 3 index._collection.query(query_embeddings=[embedded], where={"author": "author1"})

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/api/models/Collection.py:223, in Collection.query(self, query_embeddings, query_texts, n_results, where, where_document, include) 220 if where_document is None: 221 where_document = {} --> 223 return self._client._query( 224 collection_id=self.id, 225 query_embeddings=query_embeddings, 226 n_results=n_results, 227 where=where, 228 where_document=where_document, 229 include=include, 230 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/api/local.py:457, in LocalAPI._query(self, collection_id, query_embeddings, n_results, where, where_document, include) 447 @override 448 def _query( 449 self, (...) 455 include: Include = ["documents", "metadatas", "distances"], 456 ) -> QueryResult: --> 457 uuids, distances = self._db.get_nearest_neighbors( 458 collection_uuid=collection_id, 459 where=where, 460 where_document=where_document, 461 embeddings=query_embeddings, 462 n_results=n_results, 463 ) 465 include_embeddings = "embeddings" in include 466 include_documents = "documents" in include

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/clickhouse.py:613, in Clickhouse.get_nearest_neighbors(self, collection_uuid, where, embeddings, n_results, where_document) 610 ids = None 612 index = self._index(collection_uuid) --> 613 uuids, distances = index.get_nearest_neighbors(embeddings, n_results, ids) 615 return uuids, distances

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py:285, in Hnswlib.get_nearest_neighbors(self, query, k, ids) 283 labels: Set[int] = set() 284 if ids is not None: --> 285 labels = {self._id_to_label[hexid(id)] for id in ids} 286 if len(labels) < k: 287 k = len(labels)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py:285, in (.0) 283 labels: Set[int] = set() 284 if ids is not None: --> 285 labels = {self._id_to_label[hexid(id)] for id in ids} 286 if len(labels) < k: 287 k = len(labels)`

thomaspile avatar May 08 '24 15:05 thomaspile

@thomaspile,

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/chromadb/db/clickhouse.py:613

This tells me that you are using an old version of Chroma, possibly in the 0.3.x range. Chroma has dropped clickhouse support since 0.4.0. Is it possible for you to upgrade to 0.5.0?

tazarov avatar May 08 '24 16:05 tazarov

@tazarov

Ah yes, I had to downgrade chroma because version 0.4.0 and above results in an OperationalError: disk I/O error. Do you have any idea why that might be?

Here is the stack:

`File <command-410812424293831>, line 69, in Embedder.import_index(self)
     65 def import_index(self):
     67     print('Importing index')
---> 69     self.index = Chroma(persist_directory=f'{self.directory}',
     70                         embedding_function=self.embedding_service,
     71                         collection_metadata={"hnsw:space": "cosine"}) 
     73     self.index_ids = self.index.get()['ids']

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:121, in Chroma.__init__(self, collection_name, embedding_function, persist_directory, client_settings, collection_metadata, client, relevance_score_fn)
    119         _client_settings = chromadb.config.Settings()
    120     self._client_settings = _client_settings
--> 121     self._client = chromadb.Client(_client_settings)
    122     self._persist_directory = (
    123         _client_settings.persist_directory or persist_directory
    124     )
    126 self._embedding_function = embedding_function

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/__init__.py:145, in Client(settings)
    142 telemetry_client = system.instance(Telemetry)
    143 api = system.instance(API)
--> 145 system.start()
    147 # Submit event for client start
    148 telemetry_client.capture(ClientStartEvent())

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/config.py:268, in System.start(self)
    266 super().start()
    267 for component in self.components():
--> 268     component.start()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:93, in SqliteDB.start(self)
     91     cur.execute("PRAGMA foreign_keys = ON")
     92     cur.execute("PRAGMA case_sensitive_like = ON")
---> 93 self.initialize_migrations()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/migrations.py:128, in MigratableDB.initialize_migrations(self)
    125     self.validate_migrations()
    127 if migrate == "apply":
--> 128     self.apply_migrations()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/migrations.py:147, in MigratableDB.apply_migrations(self)
    145 def apply_migrations(self) -> None:
    146     """Validate existing migrations, and apply all new ones."""
--> 147     self.setup_migrations()
    148     for dir in self.migration_dirs():
    149         db_migrations = self.db_migrations(dir)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:149, in SqliteDB.setup_migrations(self)
    147 @override
    148 def setup_migrations(self) -> None:
--> 149     with self.tx() as cur:
    150         cur.execute(
    151             """
    152              CREATE TABLE IF NOT EXISTS migrations (
   (...)
    160              """
    161         )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite.py:47, in TxWrapper.__exit__(self, exc_type, exc_value, traceback)
     45 if len(self._tx_stack.stack) == 0:
     46     if exc_type is None:
---> 47         self._conn.commit()
     48     else:
     49         self._conn.rollback()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-b60488cf-2c0f-4e55-aa5e-7e10b2665d31/lib/python3.10/site-packages/chromadb/db/impl/sqlite_pool.py:31, in Connection.commit(self)
     30 def commit(self) -> None:
---> 31     self._conn.commit()

OperationalError: disk I/O error`

thomaspile avatar May 09 '24 10:05 thomaspile

@thomaspile, there is a bit of a migration procedure if you start with 0.3.x (https://docs.trychroma.com/migration#migration-from-040-to-040---july-17-2023).

The steps for migration would be as follows:

  • Start with the above link, but install Chroma 0.4.15<
  • Once you successfully upgrade from 0.3.x to 0.4.15< then upgrade Chroma to 0.5.0 and try to access your DB

If the above seems somewhat complex, you can export your data and import it again. A CSV or similar should do fine.

tazarov avatar May 09 '24 11:05 tazarov

Thanks @tazarov, however I am just looking to start a new index from scratch, so no migration is necessary. The error happens regardless.

thomaspile avatar May 09 '24 11:05 thomaspile

I think the error you are encountering has to do with your storage medium.

Looking at /local_disk0/.ephemeral_nfs I can assume this is some sort of block storage. Can you elaborate on how you are running Chroma (container, chroma run, etc.)?

tazarov avatar May 09 '24 11:05 tazarov

@tazarov I am running on Azure Databricks, and importing chroma using langchain_community.vectorstores import Chroma

thomaspile avatar May 09 '24 11:05 thomaspile

If my assumption is correct, you are using some sort of shared storage that relies on NFS. NFS is inherently not a good choice for Chroma workloads and will occasionally result in the I/O error above.

Can you point persistent dir to another location?

tazarov avatar May 09 '24 12:05 tazarov

Ah right I see, unfortunately I can't use any other location. I've decided to use cosmos DB for now instead. Thanks for your help anyway, much appreciated.

thomaspile avatar May 25 '24 22:05 thomaspile