ann-benchmarks icon indicating copy to clipboard operation
ann-benchmarks copied to clipboard

category filter

Open gtsoukas opened this issue 2 years ago • 3 comments

Would it be in the spirit of this benchmarks to add a second benchmark category for ANN in conjunction with categorial filters?

Most real-world applications of ANN will required category filtering e.g. when searching for cloths in an e-commerce scenario via ANN one might filter by gender (categorial) or availability (categorial).

There are several software products which allow combining ANN and category filters e.g. Apache Solr, Elasticsearch, Vertex AI Matching Engine, weaviate, qdrant. However, they mainly differ to this benchmarks in that they are managed services or just services but not embeddable libraries.

In addition to recall vs. queries per second there should be a view which filters to a fraction of the date vs. recall vs. queries per second. For the proprietary managed services, also a cost dimension might be useful.

I have found the following blog articles covering the topic:

  • https://blog.vasnetsov.com/posts/categorical-hnsw/
  • https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c
  • https://towardsdatascience.com/using-approximate-nearest-neighbor-search-in-real-world-applications-a75c351445d

Given that this would be very useful for practical implementations but also the fact that it significantly complicates the benchmarks I would be interested in your opinion and/or how I could help with it. Also I would be great to know if someone has already done such benchmarks.

gtsoukas avatar Aug 07 '22 08:08 gtsoukas

I think that would be interesting! I think the downside is

  1. Would make it more complex
  2. Not sure if there's any obvious public datasets for this?

erikbern avatar Aug 08 '22 09:08 erikbern

I think that would be interesting! I think the downside is

  1. Would make it more complex

Fully agree, probably the key reason not to do it.

  1. Not sure if there's any obvious public datasets for this?

Datasets from the existing benchmark could be reused if an additional artificial, categorial random variable is introduced, allowing to filter to fractions of the original dataset between 0-100%. The approach is described here: https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c

gtsoukas avatar Aug 08 '22 11:08 gtsoukas

if an additional artificial, categorial random variable is introduced, allowing to filter to fractions of the original dataset between 0-100%

I think that makes sense, but it would be nice if there's some more natural way to do it. Eg for the MNIST dataset, filtering by digit 0-9 could be nice.

erikbern avatar Aug 09 '22 10:08 erikbern