ann-filtering-benchmark-datasets
ann-filtering-benchmark-datasets copied to clipboard
Collection of datasets for benchmarking filtered vector similarity retrieval
ANN Filtered Retrieval Datasets
This repo contains a collection of datasets, inspired by ann-benchmarks for searching for similar vectors with additional filtering conditions.
Motivation
More and more applications are now using vector similarity search in their products. The task of approximate nearest neighbor (ANN) search has gone beyond the scope of academic research and the narrow circle of huge IT corporations.
In this regard, the issue of supplementing vector search with application business logic is becoming more and more relevant.
Examples and cases
It is no longer enough to simply search for similar dishes by photo, you only need to search for them in those restaurants that are in the delivery area.
It is not enough to search for all items similar by description, you also need to consider price ranges, stock availability, etc.
It's not enough to find candidates for a job position based on similar skills, you also have to consider location, level of spoken language, and seniority.
You name it.
Is it that different?
Classical approaches to ANN, and their implementations in many libraries, were usually customized for benchmarks, where the search speed among all vectors is the only comparison criterion.
Because of this, they had to sacrifice many functions that are useful in other situations: the ability to quickly delete, insert and modify stored values, as well as saving and filtering based on metadata.
Data
description | Num vectors | dim | distance | filters | link |
---|---|---|---|---|---|
all-MiniLM-L6-v2 ArXiv titles | 2 138 591 | 384 | Cosine | match keyword / range | link |
Efficientnet encoded H&M Clothes | 105 100 | 2048 | Cosine | match keyword | link |
LAION Sample encoded with CLIP | 100 000 | 512 | Cosine | range | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | match keyword | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | match int | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | range | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | geo-radius | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | match keyword | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | match int | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | range | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | geo-radius | link |
Data Format
Each dataset contains of following files:
-
vectors.npy
- Numpy matrix of vectors. Shapenum_vectors x dim
-
payloads.jsonl
- payload values, associated with vectors. Number of lines equal tonum_vectors
-
tests.jsonl
- collection of queries with filtering conditions and expected results. Contains fields:-
query
- vector to be used for similarity search -
conditions
- filtering conditions of 3 possible types:match
,range
, andgeo
-
closest_ids
- IDs of records, expected to be found with given query -
closest_scores
- similarity scores of associated IDs
-
Example queries
{
"query": [-0.034, -0.185, -0.21, ...],
"conditions": {
"and": [
{
"department_name": {
"match": {
"value": "Divided Shoes"
}
}
}
]
},
"closest_ids": [565, 15631, 100747, ....],
"closest_scores": [0.734, 0.698, 0.697, 0.689, ...]
}