chroma Query with unique metadata filter.

Describe the problem

I can tell you where I need to do query for unique metadata.

I am ingesting large document texts as embeddings into chromadb. I am creating chunks of tokens of these texts due to token limitation of embedding model. The token size is 512.
I will be generating embeddings of these tokens but these chunks are of same document which is referred as doc_id.
When I do query and if any of the chunk in this document is matched then i do not want any other chunk from same document. This ensures that one document chunk if matched then we do not search other chunks as it will be of same document.
I am planning to store the doc_id as metadata for all chunks.
So I need a distinct query on metadata for doc_id. Currently I am doing manual filtering by keeping doc_id in set and then trying to check whether doc_id exists or not which is ineffiecient.

Describe the proposed solution

I tried something like collection.get(where={"$distinct": "doc_id"})

but this does not work. Also I have not found any reference of distinct in the chroma documentation.

Alternatives considered

Manually filtering document after checking the doc_id metadata exists or not.

Importance

would make my life easier

Additional Information

No response

Sep 27 '24 11:09 Mhsh

@Mhsh

Which chunk would get returned when you use this "$distinct" operator?

Here's a suggestion:

Add a metadata field for order_in_doc or something and then when you do your chunking/splitting, iterate through the chunks from one doc and set order_in_doc to an incrementing value (0,1,2,3,4, etc.). Then when you are searching you can include a filter order_in_doc: 0, and this will include only the first chunk from any matching/similar doc. But, again, this depends how you want "$distinct" to work.

Apr 18 '25 02:04 hesreallyhim

Hi, because where clause is pre-filtering, by allowing distinct, we'd be taking away the power of semantic search. For example: lets say for this query collection.query(query_text = ["hello world"], n_results=5, where={"key1":{"$eq": "value1"}}) the where filter narrowed the ids down to 20 where the key1 == value1. It then does vector search against those 20 IDs using the query_text to order them by distance. If we were to allow $distinct, the filter would instead allow only 1 ID. This may not even be one of the closest vectors to the query text.

What it sounds like you would like is post-filtering, where you compute a vector search, then of the n_results you get, you want to filter down to just 1 where the ID matches your specific query. Will close this thread here, and instead start a new one for post-query filtering.

May 14 '25 21:05 jairad26