chroma
chroma copied to clipboard
Inconsistent behaviour when filtering by metadata using $or
Hi,
i'm running into a problematic behaviour with querying chroma using $or on metadata from python:
Given a collection with a set of embeddings where the metadata contains information like this:
{
"category": "all"
}
and a second set of embeddings with the metadata like
{
"category": "test"
}
where the embeddings in the second set have both higher and lower distances to the search pattern used
embeds = collection.query(
query_texts=[search],
where={"$or": [{"category": {"$eq": "test"}}, {"category": {"$eq": "all"}}]},
n_results=3
)
and that the embeddings for category "all" were added to the collection before adding category "test"
Current behaviour: the query returns only data from category "all"
Expected behaviour: the query should return values from category "all" and from category "test"
Querying individually for category "test", the resulting embeddings show both higher als lower distances to the ones returned for category "all" Using a different, not existing, second category for the $or , the embeddings for category "test" are returned correctly
Any ideas what could be wrong?
How much data do you have and whats the distribution across categories? I attempted to repro
https://colab.research.google.com/drive/1glfVqsOOVxyLsJfmIq1GJBd9pazu1e-T#scrollTo=iWGs42guud0w
In this colab, and am seeing results for both categories. So am not able to reproduce the issue. It is possible that your data is distributed in such a way that prefiltering HNSW is causing a pathological case.
@HammadB could you please clarify what you mean by prefiltering HNSW? My understanding is that when querying with an embedding and a set of metadata filters, the filters are applied first, and approximate nearest neighbour is done on the results that match the filter only.
Yes that is correct. Applying the filters first is called "prefiltering" If we queried for nearest neighbors and then filtered after it would be "post-filtering"
This is stale. Closing this issue for now. Happy to re-open.