Query with unique metadata filter.
Describe the problem
I can tell you where I need to do query for unique metadata.
- I am ingesting large document texts as embeddings into chromadb. I am creating chunks of tokens of these texts due to token limitation of embedding model. The token size is 512.
- I will be generating embeddings of these tokens but these chunks are of same document which is referred as doc_id.
- When I do query and if any of the chunk in this document is matched then i do not want any other chunk from same document. This ensures that one document chunk if matched then we do not search other chunks as it will be of same document.
- I am planning to store the doc_id as metadata for all chunks.
- So I need a distinct query on metadata for doc_id. Currently I am doing manual filtering by keeping doc_id in set and then trying to check whether doc_id exists or not which is ineffiecient.
Describe the proposed solution
I tried something like collection.get(where={"$distinct": "doc_id"})
but this does not work. Also I have not found any reference of distinct in the chroma documentation.
Alternatives considered
Manually filtering document after checking the doc_id metadata exists or not.
Importance
would make my life easier
Additional Information
No response
@Mhsh
Which chunk would get returned when you use this "$distinct" operator?
Here's a suggestion:
- Add a metadata field for
order_in_docor something and then when you do your chunking/splitting, iterate through the chunks from one doc and setorder_in_docto an incrementing value (0,1,2,3,4, etc.). Then when you are searching you can include a filterorder_in_doc: 0, and this will include only the first chunk from any matching/similar doc. But, again, this depends how you want "$distinct" to work.
Hi, because where clause is pre-filtering, by allowing distinct, we'd be taking away the power of semantic search. For example: lets say for this query collection.query(query_text = ["hello world"], n_results=5, where={"key1":{"$eq": "value1"}}) the where filter narrowed the ids down to 20 where the key1 == value1. It then does vector search against those 20 IDs using the query_text to order them by distance. If we were to allow $distinct, the filter would instead allow only 1 ID. This may not even be one of the closest vectors to the query text.
What it sounds like you would like is post-filtering, where you compute a vector search, then of the n_results you get, you want to filter down to just 1 where the ID matches your specific query. Will close this thread here, and instead start a new one for post-query filtering.